Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

ADM3308-Fall 2021: Business Data Mining

Assignment #2 (Group Work)


_____________________________________________________________________

Weight: 5% of the final mark.

Submission Instructions:
 Submit the report on Brighspace® no later than 10 AM on Nov. 02, 2021.
 Only one submission per team. Please choose a team representative to submit
one copy of the report on BrightSpace

Statements on Group Contribution and Academic Integrity

When submitting your group work, the submission must include the following two
statements. Without the following two statements included in your submission, your
assignment will not be marked.

(a) Statement of Contributions


In one (or two) paragraphs, explain the contributions of each group member to the
assignment. Mention the name of the group member and the specific tasks (or
items) accomplished by that group member as their contribution to the
assignment. A submission without this Statement of Contributions will not be
marked.

(b) Academic Integrity Statement


Each individual member of the group must read the Academic Integrity
Statement, and type in his or her full name and student ID. The Academic
Integrity Statement must be e-signed by ALL members of the group UNLESS a
group member has not contributed to the assignment. A submission without the
signed Academic Integrity Statement will not be marked.

NOTE: If the above two statements are not included in your original submission, the
assignment will not be marked. Then, the following deductions will be applied:

-20% if the statement was not submitted with the original submission, but was
submitted after reminded by the TA within 24 hours.

-100% if the statement was not submitted within 24 hours after reminded by the TA.

Important Note: Each individual member of the group must read the following
academic integrity statement, and type in their full name and student ID (which is
considered as the signature). The Academic Integrity Statement must be e-signed by
ALL members of the group UNLESS a group member has not contributed to the
assignment. Your assignment will not be marked if the following academic integrity
statement is not submitted.

University of Ottawa Telfer School of Management Page 1 of 11


ADM3308-Fall 2021: Business Data Mining

Personal Ethics & Academic Integrity Statement

Student name: Ketong Li Student ID: 300112336

Student Name: Zhenni Zhao Student ID: 300128622

Student Name: Rong Mu Student ID: 300127742

By typing in my name and student ID on this form and submitting it electronically, I am


attesting to the fact that I have reviewed not only my own work, but the work of my team
member, in its entirety.

I attest to the fact that my own work in this project adheres to the fraud policies as
outlined in the Academic Regulations in the University’s Undergraduate Studies
Calendar. I further attest that I have knowledge of and have respected the “Beware of
Plagiarism” brochure found on the Telfer School of Management’s doc-depot site. To the
best of my knowledge, I also believe that each of my group colleagues has also met the
aforementioned requirements and regulations. I understand that if my group assignment
is submitted without a completed copy of this Personal Work Statement from each group
member, it will be interpreted by the school that the missing student(s) name is
confirmation of non-participation of the aforementioned student(s) in the required work.

We, by typing in our names and student IDs on this form and submitting it electronically,
 warrant that the work submitted herein is our own group members’ work and not the
work of others;

 acknowledge that we have read and understood the University Regulations on


Academic Misconduct;

 acknowledge that it is a breach of University Regulations to give or receive


unauthorized and/or unacknowledged assistance on a graded piece of work.

Statement of Contribution
Question 1 and 6 are done by Ketong Li (Student No. 300112336).
Question 2 and 4 are done by Zhenni Zhao (Student No. 300128622).
Question 3 and 5 are done by Rong Mu (Student No. 300127742).

University of Ottawa Telfer School of Management Page 2 of 11


ADM3308-Fall 2021: Business Data Mining

Part I- Review Questions

Q1) [20 marks] List and explain different stages in customer life cycle, the business
process organized around the customer life cycle, and mention how data mining can help
in each process.
Customer life cycle
There are five stages in customer life cycle.
1. Prospects: Identify the target market. In this stage, people in the target market are
not the customers, they are the people who within the calculated model that may
willing to purchase the products. The rest of the world (not the target market) will
have less chance to buy products.
2. Responders: These are people who in the target market and are interested in the
products. They are the intended customers of the company such as for those who
subscribe to the website or specific products and may willing to pay in the future.
3. New customers: These are responders from the last step who place orders or
decide on buying the products in the future.
4. Established customers: These are those new customers who find the products
reliable or well-designed and are willing reorder the products or signing deals for
purchasing large quantity of the products.
5. Former customers: In this stage, existing customers are no longer decided to
purchase and use the products. This action may be due to different reason
including voluntary attrition such as moving to the competitor with lower price,
expected attrition for example changing the taste or not work in the related fields
and forced attrition in which that could not afford the products (Linoff, G., &
Berry, M., 2011).

Business process organized around the customer life cycle (data mining
applications)
 Customer Acquisition
Customer acquisition means that the company using some strategies to attract potential
customers and make them willing to buy the products from the company and finally
become the customers. This process will be done by identifying the target market, using
mailing or email to send out the information and promotion of the targeted audience in
the targeted areas. Those prospects will change over time by multiple reasons, it is always
helpful to define new targeted area and audience. The role of data mining in this case are
finding correct target market and audience, being effective in communicating with those
prospects such as selecting the rights tools at the appropriate tine with understandable and
attractive message. (Linoff, G., & Berry, M., 2011).
 Customer Activation
Customer activation is the process of intended audience come into place to subscribe and
register for the website or products. They are willing to get more insight information of
specific products and it is considered as a business move. There are four steps involve in
this process:

University of Ottawa Telfer School of Management Page 3 of 11


ADM3308-Fall 2021: Business Data Mining

1. The Sale (Leads): The targeted audience would like to learn more about specific
products or subscription that he/she may give out their personal information for
further purchase use.
2. The Order: This step involves they have successful become a subscripted member
of the website and their information are being verifies.
3. The Subscription: More verification needed for the final purchase.
4. The Paid Subscription: This last step involves that the intended audience has
become the customers and their order of products or subscription have been made.

Data mining in these steps involve identifying if the customer stays in the activation stage
or not and find out the reasons for the leaving of the customers such as the processes are
too complicated, or the products or subscriptions are not attractive enough or the
customers change their mind. Through analyzing this information, the company will learn
more and improve themselves on the performance.

 Customer Relationship Management


There are four activities to boost customer relationship as well as making profit:
1. Up-selling: when customers making purchases, recommend them with higher
price products or higher quality products to increase their experiences.
2. Cross-selling: when customers making purchase of one product, recommend
another related product to enhance their experiences. For example, when someone
buy a PS5, recommend he a new game that he may interested in such as FIFA
2k22.
3. Customer value calculation: estimating the value of each customer, such as the
price range the customer may want to pay for or if the customers would come
back again to make another purchase or not.
4. Usage stimulation: asking customers to subscribe to the company’s website to
receive further information on more products that will increase the chance for
them to come back again.
Data mining applications for customer relationship management involve making
appropriate recommendation to each customer base on their age, income, preference, and
area; identifying different target customer and divide them in different group, for each
group, design specific marketing campaign that may be most attractive to them;
predicting if the customer is able to afford the products or subscriptions to decrease the
risk as well as maximize the profits (Linoff, G., & Berry, M., 2011).

 Retention
It is always better to keep the customers loyal to the company and building strong
relationships with them, so it is important to identify the attrition of the customers;
effective retention campaigns are necessary.
Regarding the application of data mining, by designing the model for attrition, the
company can find what and why customers will leave, who will stay and how long will
they stay in order to make better plan on developing the relationships. Some models for
this retention purpose include classification methods, neutral networks, and decision trees
(Linoff, G., & Berry, M., 2011).

University of Ottawa Telfer School of Management Page 4 of 11


ADM3308-Fall 2021: Business Data Mining

Q2) [10 marks] What is “over fitting”? what is its drawback? What are the approaches to
take to avoid “over fitting”?

“Over fitting” occurs when a model using too many data, information, and parameters in
a specific dataset to modify the statistical model. If the model is too complicated, it will
perfectly fit the data available there but fail to generalize its predictive ability to other
data sets. This happens when the model precisely matched with the training dataset or
information provided and well-explained them, may leading to be unadjusted with other
additional and relevant information and dataset, as well as the inability to predict the
future patterns and observations. In another words, the model that has been built only
could memorize the data instead of recognizing more patterns and predicting further
observations (Nelson, 2020).

One of the methods to avoid is that having more available data to train the model. By
adding more data, the algorithms of the model will be able to access data through
different sources and will not stick-on specific dataset to imitate and memorize it, which
will increase the accuracy. The more data available for the model, the less chance it will
be over fitting with the samples and the better patterns and results it will produce
(Corporate Finance Institute, n.d.).

Another method to prevent over fitting is called cross-validation in which it splits the
sample into two parts, one of them contain most of the sample and leave the other part
with small proportion. The purpose of using the large part is to build the model while the
small part is to predict and forecast as well as record the errors (Corporate Finance
Institute, n.d.).

Data simplification is an effective way to avoid over fitting that it simplifies the complex
model, which has greater chance to overfit the dataset, into a more straightforward model.
For example, pruning the decision tree which involves increasing the stability of the
model and effectively avoid overfitting. There are many other methods can reduce over
fitting including Akaike information criterion, model comparison and stopping the model
before it starts to be overfitted (Corporate Finance Institute, n.d.).

University of Ottawa Telfer School of Management Page 5 of 11


ADM3308-Fall 2021: Business Data Mining

Q3) [10 marks] Aside from the Gini index and Entropy, explain two other parameters to
measure purity (best split) in a decision tree.

 Chi-Square
(Actual−Expected )2
The formula that uses for Chi-Square is √ where it calculated the
Expected
difference in value between parents and child nodes. The actual represents the observed
value while the expected is the expected frequencies of the variable. Similarly, the Chi-
square is only useful with categorical variables instead of continuous variables. If there
are no differences in the split (observed and expected have the same result), it will result
in zero. If the Chi-Square value is high, it means that there is greater difference in the
distribution in child node with the parent node and will be more purity after split. Steps
involve including finding the expected value of each, applying the formula with the
actual value of each class and add each split up to get the total and choose the higher
among the performance in class and the class (Singh, 2021).

 Information Gain
o Information Gain is employed to separate the nodes when the target variable is
categorical. It is based on the entropy idea and is given by:
Information Gain=1−Entropy
The purity of a node is calculated using entropy. The purity of a node is
proportional to its entropy value. A homogeneous node has zero entropy.
For nodes with higher purity, the information gain is higher, and the
maximum value is 1 because the entropy is subtracted by 1.
The formula for computing the entropy now:
n
Entropy=−∑ pi log 2 p i
i=1

o Use information gain to split the decision tree can follow these steps:
 Calculate the entropy of each sub node separately for each split.
 As the weighted average entropy of the sub nodes, the entropy of each
split is calculated.
 Select the segmentation that has the lowest entropy value or provides the
most information.
 Repeat steps 1-3 until you get homogeneous nodes (Sharma, 2020).

University of Ottawa Telfer School of Management Page 6 of 11


ADM3308-Fall 2021: Business Data Mining


Q4) [10 marks] What do the x-axis and y-axis represent in an ROC curve? What does the
area under the curve represent? Draw an example ROC curve for a classifier which
performs better than a random guess. Describe an application where false negative might
matter more than the accuracy of the classifier?

ROC curve shows the relative tradeoffs between true positive rate and false positive rate.

FP
The x-axis: False Positive Rate= =1−Specificity
FP +TN
TP
the y-axis: True Positive Rate = =Sensitivity .
TP+ FN

The area below the ROC curve of the test can be used as a basis for the measuring test
performance. Larger areas mean more useful tests; when the range is close to 1, the
classifier will perform the best and when the area is close to 0.5, the classifier performs
poor (similar to random guess); the classifier performs substantially worse than a random
guess when the area is less than 0.5.

The red curve above represents that a classifier performs better than a random guess.

Regarding the breathalyzer tests, a false positive would show that drivers are over the
alcohol limit when they have not even touched an alcoholic drink; a false negative would
register driver as sober when he or she are drunk, or at least over the limit. At this time, a
false negative might matter more because allowing drunk drivers to continue driving
while assuming they are sober is dangerous to them and others around them. In contrast,
dealing with the false positive, drivers can provide a blood sample to prove their

University of Ottawa Telfer School of Management Page 7 of 11


ADM3308-Fall 2021: Business Data Mining

innocence. Therefore, false negative will cause bigger problem and it matters more than
the accuracy of the classifier (Valchanov, 2018).

Part II- Problems

Q5) [10 marks] The following confusion matrix is produced during the test phase of a
classifier.

Actual vs. Classified Positive Negative


Positive 85 10
Negative 5 20

Calculate the following measures for this classifier:


 Accuracy (also called Correctness)
TP+TN 85+ 20
corre ctness= = =0.875
TP+TN + FP+ FN 85+20+ 5+10

 Specificity
TN 20
Specificity= = =0.8
FP+ TN 5+20

 Sensitivity
TP 85
Sensitivity= = =0.895
TP+ FN 85+10

 Precision
TP 85
Presision= = =0.944
TP+ FP 85+5

 F_Measure
2× Precision × Recall 2× 0.944 × 0.895
= =0.919
Precision+ Recall 0.944 +0.895

University of Ottawa Telfer School of Management Page 8 of 11


ADM3308-Fall 2021: Business Data Mining

Q6) [30 marks] Pre-processing using IBM Modeler: Use the dataset posted with this
assignment on the “Assignments” page on BrightSpace called Unclean-Bank-Data.Xlsx.
This dataset includes missing values, invalid values, and outliers.

Use the IBM SPSS Modeler to pre-process and clean the data. Note: Using Excel or
other applications are not acceptable. This question must be done with IBM Modeler
only.

Do not remove a record if there is only one missing value in that record. Instead, use the
IBM Modeler to fill in the missing value with an algorithm of your choice.

Similarly, do not remove a record if it has only one invalid value in that record. Instead,
use the IBM Modeler to fill in the invalid value with an algorithm of your choice.

If you find a record with more than one missing value, or more than one invalid value,
then you may either remove the record, or use the IBM Modeler to fill in for the missing
or invalid values.

If you detect outliers, you must record and report them, then delete the entire record.

You may also want to do other pre-processing tasks such as data normalization,
binning data, etc.

Deliverables for this Question:


1- Briefly explain the three different cleaning/ pre-processing tasks you applied on
the data using the IBM SPSS Modeler.
The first step of cleaning and pre-processing the dataset is removing the record
that contain more than one invalid or missing value from the excel sheet. There
are 3 records that have more than one invalid or missing value which are row 579,
164 and 135.

Three different cleaning/pre-processing task:


Reclassify: For those fields that should be categorical variables including gender,
car, save_act, cheq_act, mortgage and approved that have only two values such as
Male and Female, Yes and No appears to contain some non-standard format such
as M and F to represent Male and Female, Y and N to represent YES or No. In
this case, reclassify is to put those non-standard-formatted records into the
standard format. By selecting Multiple Mode and classify them into the existing
fields, the modeler can change those original value (non-standard form) into new
value (standard form) to reorganize the data. By doing so, the gender will have
only two unique values while the other may have three including the blank fields.

University of Ottawa Telfer School of Management Page 9 of 11


ADM3308-Fall 2021: Business Data Mining

Type: Type is the process that set proper range for specific field. In this case,
there are three field need to apply the Type process including age, income and
children. There are some outliers in these three fields which are unrealistic such as
negative age (age -18), 22 children by wrong entering, and extremely high income
(Annual income of -1 and 10 million). In the Missing value field, select specify
and set the proper range for age, income, and children and select nullify for check
values. After that, click the check box for define blanks. For the field save act,
cheq_act and mortgage that containing blank value, select specify under Missing
value tab and click on define blank. Connect Type with Data Audit and select
Quality tab, under the Impute Missing, select Blank and Null Value and under
Method, select algorithm and click Generate and select Missing Value Supernod.
By clicking OK, there will be a star saying Missing Value Imputation appear and
all the blanks will be filled automatically.

Binning: The last process is the clean the income field by dividing them into 5
bins. Selecting Income in bin fields and choose the number of bins of 5 and click
OK, there will be another field appear in Data Audit as Income_Bin and can be
used for future analysis.

2- Include with your submission your clean dataset (name it “Clean-Bank-


Data.xlsx”). You may submit one zip file including all your deliverables.

University of Ottawa Telfer School of Management Page 10 of 11


ADM3308-Fall 2021: Business Data Mining

References

Corporate Finance Institute (CFI). (n.d.). Overfitting.


https://corporatefinanceinstitute.com/resources/knowledge/other/overfitting/#:~:te
xt=Overfitting%20is%20a%20term%20used%20in%20statistics%20that

Linoff, G., & Berry, M. (2011). Data Mining Techniques: For Marketing, Sales, and
Customer Relationship Management. John Wiley & Sons Inc.

Nelson, D. (2020, August 23). What is Overfitting? Unite.AI. https://www.unite.ai/what-


is-overfitting/

Sharma, A. (2020, June 30). Decision Tree Split Methods | Decision Tree Machine
Learning. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2020/06/4-
ways-split-decision-tree/

Singh, H. (2021, March 25). How to select Best Split in Decision Trees using Chi-
Square. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/03/how-
to-select-best-split-in-decision-trees-using-chi-square/

Valchanov, I. (2018, June 6). False Positive and False Negative. Medium.
https://towardsdatascience.com/false-positive-and-false-negative-b29df2c60aca

University of Ottawa Telfer School of Management Page 11 of 11

You might also like