Professional Documents
Culture Documents
Improving Association Rules
Improving Association Rules
Improving Association Rules
iv
DECLARATION
23 February 2015
KHOO SEE JUN
SN089817
76
APPROVAL SHEET
Signature: .
Date:
76
ABSTRACT
The objectives of this project are to develop Data Mining algorithms for analysing
public domain databases. The public domain for this project is shopping centre. My
task for this project is to identify and perform an association rule mining task which
involves selecting an appropriate data set, preparing and preprocessing the data,
finding rules, including appropriate parameter setting, determining which of the
resulting rules are interesting and figuring out how the interesting rules could be
useful.
This work analyses well-known DM techniques in Weka workbench, and report the
simulation results using sample data by applying four selected DM techniques and
classifiers in the open source workbench to the Customer Relationship Management
(CRM) in shopping centre.
The design of the data mining process has been done in Chapter 4. This will show
how the data mining workflow.
The next section is implementation. The implementation begun with data mining
process (methodology).After that proceed with build models set for prediction. Then
compile modified code using Apache-Ant so that the code can be used by WEKA.
76
Lastly, generate best rules by import dataset into the WEKA software and compare the
run time between original code and modified code in different WEKA. The result
show that the run time have been improved with modified code.
TABLE OF CONTENTS
Page
DECLARATION
ii
APPROVAL SHEET
iii
ABSTRACT
iv
TABLE OF CONTENTS
LIST OF TABLES
ix
LIST OF FIGURES
CHAPTER 1 INTRODUCTION
1.0 INTRODUCTION
1.1 Project Introduction
1.2 Problem Statement
1.3 Objectives
1.4 Benefits
1.5 Project Scope
1.6 Expected Outcome
1.7 Gantt Chart
1
2
2
3
4
5
5
76
2.1.1 Association
2.1.2 Classification
2.1.3 Prediction
2.1.5 Clustering
2.1.7Decision Table
2.2 Customer Relationship Management (CRM)
10
10
10
10
11
11
12
12
13
15
17
17
18
19
20
2.3.5 Medical/Pharma
20
76
21
21
22
23
23
24
25
27
28
28
CHAPTER 3: ANALYSIS
3.0 ANALYSIS
3.1 Data Mining for Shopping Centers
31
3.1.1 Free Sample Data for Testing Purpose
32
3.1.2 Related Work
33
3.1.3 Methods
35
3.1.4 Result and Discussion
36
3.1.5 Comparison between Nave Bayes (NB), Decision Table (DT) and
Decision Tree (J48)
42
3.1.6 Comparison between classifiers with time taken to build a model
3.2Association Rules Apriori Algorithm
3.2.1 Apriori Algorithm
3.2.2 Limitations of Apriori Algorithm
44
44
44
44
CHAPTER 4: DESIGN
4.0 DESIGN
4.1 Data Mining Process
46
76
4.1.1 Step One: Translate the business into a data mining problem
48
48
48
50
51
52
53
54
CHAPTER 5: IMPLEMENTATION
5.0 IMPLEMENTATION
5.1 Data Mining Process
56
57
57
59
61
62
63
64
66
67
69
CHAPTER 6: CONCLUSION
6.0 CONCLUSION
6.1 Progress and Outcome
70
71
76
71
REFERENCES
72
APPENDIX
77
LIST OF TABLES
Table No.
Page
TABLE 3.1
42
TABLE 3.2
44
76
LIST OF FIGURES
Figure No.
Page
15
24
28
32
33
34
37
38
39
40
41
41
42
47
76
49
49
50
Figure 4.5 Data from the past mimics data from the past, present, and future 51
Figure 4.6 Sample data
51
53
54
55
55
57
59
59
60
61
62
63
64
65
65
76
66
67
67
68
68
69
69
CHAPTER 1
1.0INTRODUCTION
76
data warehouses. In businesses can learn more about their customers and develop
more effective marketing strategies as well as increase sales and decrease costs.
Grocery stores are well-known users of data mining techniques. Many supermarkets
offer free loyalty cards to customers that give them access to reduced prices not
available to non-members. The cards make it easy for stores to track who is buying
what, when they are buying it, and at what price. The stores can then use this data,
after analyzing it, for multiple purposes, such as offering customers coupons that are
targeted to their buying habits and deciding when to put items on sale and when to sell
them at full price. Data mining tools predict behaviours and future trends, allowing
businesses to make proactive, knowledge-driven decisions. Data mining tools can
answer business questions that traditionally were too time consuming to resolve. They
scour databases for hidden patterns, finding predictive information that experts may
miss because it lies outside their expectations.
76
aim at a company to create powerful strategies, make fast and feasible decisions and
achieve competitive advantage in future.
1.3 Objectives
The objectives of this project are:
1. To identify parameters of the algorithm
2. To design new data mining algorithms
3. To develop new data mining algorithms
1.4 Benefits
The benefits of data mining in businesses are:
1. More Money. Money is always a good thing in business. When data is mined
that unearths the kinds of projects past donors contributed to, types of products
customers have purchased in the past, or a not-for-profit can put a number on
statistics for a grant proposal, it can result in serious cash. Once a business
knows who the top donors are or what their customers want, they can
customize approaches and outreach.
2. Improve Branding and Marketing. Data can reveal a number of things like
what direction the marketing department should take. For example, there might
have been a recent customer survey asking about what services or products
consumers want to see. That kind of information is gold, and a marketing
department can do wonders with it. If a survey or any feedback is being
collected, put it to use.
76
76
in July can work on maximizing that month, while giving extra attention to
periods where sales slack.
76
Chapter 2
2.0 LITERATURE REVIEW
2.1.2 Classification
We can use classification to build up an idea of the type of customer, item, or object
by describing multiple attributes to identify a particular class. For example, we can
76
easily classify cars into different types (sedan, 4x4, convertible) by identifying
different attributes (number of seats, car shape, driven wheels). Given a new car, we
might apply it into a particular class by comparing the attributes with our known
definition. We can apply the same principles to customers, for example by classifying
them by age and social group.
2.1.4 Prediction
Prediction is a wide topic and runs from predicting the failure of components or
machinery, to identifying fraud and even the prediction of company profits. Used in
combination with the other data mining techniques, prediction involves analyzing
trends, classification, pattern matching, and relation. By analyzing past events or
instances, we can make a prediction about an event. For example, using the credit card
authorization, we combine decision tree and classification to analysis an individual
past transaction to identify whether a transaction is fraudulent by matching the
historical pattern of the individual.
76
items be added to their shopping cart based on their frequency and past purchasing
history.
2.1.6 Clustering
By examining one or more attributes or classes, we can group individual pieces of
data together to form a structure opinion. Clustering is using one or more attributes as
basis for identifying a cluster of correlating results. Clustering is useful to identify
different information because it correlates with other examples so we can see where
the similarities and ranges agree.
Clustering can work both ways. We can assume that there is a cluster at a certain point
and then use our identification criteria to see if we are correct. The graph in Figure
2.1 shows a good example. In this example, a sample of sales data compares the age
of the customer to the size of the sale. It is not unreasonable to expect that people in
their twenties (before marriage and kids), fifties, and sixties (when the children have
left home), have more disposable income.
76
76
Decision trees are often used with classification systems to attribute type information,
and with predictive systems, where different predictions might be based on past
historical experience that helps drive the structure of the decision tree and the output.
2.1.8 Decision Table
Decision tables, like decision trees, are classification models used for prediction. They
are induced by machine learning algorithms. A decision table consists of a hierarchical
table in which each entry in a higher level table gets broken down by the values of a
pair of additional attributes to form another table. The structure is similar to
dimensional stacking.
76
software,
and
usually
Internet
capabilities
that
help
76
76
The biggest benefit most businesses realize when moving to a CRM system comes
directly from having all the business data stored and accessed from a single location.
Before CRM systems, customer data was spread out over office productivity suite
documents, email systems, mobile phone data and even paper note cards and Rolodex
entries. Storing all the data from all departments (e.g., sales, marketing, customer
service and HR) in a central location gives management and employees immediate
access to the most recent data when they need it. Departments can collaborate with
ease, and CRM systems help organization to develop efficient automated processes to
improve business processes.
2.2.6Data
Mining
and
Customer
Relationship
Management [17]
The first task, identifying market segments, requires significant data about prospective
76
customers and their buying behaviours. In theory, the more data the better. In practice,
however, massive data stores often impede marketers, who struggle to sift through the
minutiae
to
find
the
nuggets
of
valuable
information.
Recently, marketers have added a new class of software to their targeting arsenal.
Data mining applications automate the process of searching the mountains of data to
find
patterns
that
are
good
predictors
of
purchasing
behaviours.
After mining the data, marketers must feed the results into campaign management
software that, as the name implies, manages the campaign directed at the defined
market segments.
In the past, the link between data mining and campaign management software was
mostly manual. In the worst cases, it involved "sneaker net," creating a physical file
on tape or disk, which someone then carried to another computer and loaded into the
marketing database.
This separation of the data mining and campaign management software introduces
considerable inefficiency and opens the door for human errors. Tightly integrating the
two disciplines presents an opportunity for companies to gain competitive advantage.
76
Model building is the next phase of the Data mining tool, which builds the various
models according to the data given in the data preparation phase. The last phase is the
evaluation of the model, so that the proper results in the form of useful patterns can be
drawn from the models built by the tools.
The tools of data mining for CRM should be able to detect the necessary information
from the available data .To achieve this, Data mining tools should have some
characteristic like:
76
Figure
2.3
Data Mining Applications Useful For Companies.
(Adopted from http://www.informationweek.com/673/73iudat.htm)
Figure 2.3 shows that the Customer demographics are one of the most important
applications for the companies. The application of Data Mining tools are in:
76
76
Anticipate and prevent customer attrition: The data mining tool can help to
find the customers which are not satisfied by the firms services. This helps the
firms to give promotional services to group of customers who are likely to
attrite.
Mine unstructured data, such as text: The text data is always unstructured.
So data mining tools can help to mine the unstructured data to help the various
organizations to get good out of the data.
Banking/Finance
Retail Industry
Telecommunication Industry
Medical/Pharma
76
Intrusion Detection
76
Customer Retention.
76
76
2.3.5 MEDICAL/PHARMA
Data mining is a very important part in medical field. By getting through data mining,
research for new cure for rare diseases rate will be higher. Below are the aspects in
which data mining contribute for medical field:
76
76
There is large amount of data sets being generated because of the fast numerical
simulations in various fields such as climate, and ecosystem modelling, chemical
engineering, fluid dynamics etc. Following are the applications of data mining in field
of Scientific Applications:
Graph-based mining.
76
Information Retrieval
Pattern Recognition
Image Analysis
Signal Processing
Computer Graphics
Web Technology
Business
Bioinformatics
76
The data mining system can be classified according to the following criteria:
Database Technology
Statistics
Machine Learning
Information Science
Visualization
Other Disciplines
76
Data Types - The data mining system may handle formatted text, record-based
data and relational data. The data could also be in ASCII text, relational database data
or data warehouse data. Therefore we should check what exact format, the data mining
system can handle.
Data Sources - Data Sources refers to the data formats in which data mining
system will operate. Some data mining system may work only on ASCII text files
while other on multiple relational sources. Data mining system should also support
ODBC connections or OLE DB for ODBC connections.
76
Data Mining functions and methodologies - There are some data mining
systems that provide only one data mining function such as classification while some
provides multiple data mining functions such as concept description, discovery-driven
OLAP
analysis,
association
mining,
linkage
analysis,
statistical
analysis,
No coupling
Loose Coupling
Tight Coupling
76
as column scalable if the mining query execution time increases linearly with number
of columns.
Data Visualization
Data Mining query language and graphical user interface - The graphical
user interface which is easy to use and is required to promote user guided, interactive
data mining. Unlike relational database systems data mining systems do not share
underlying data mining query language.
76
Integration of data mining with database systems, data warehouse systems and
web database systems.
Web mining
76
Starting from the knowledge discovery processes used in early data mining projects,
CRISP-DM defined and validated a data mining process that could be applicable in
any industry sectors. This methodology should make large data mining projects faster,
cheaper, more reliable and more manageable. However, even small scale data mining
investigations can benefit from using it.
76
This process model provides a simple overview of the life cycle of a data mining
project. Corresponding phases of a data mining project are clearly identified
throughout tasks and relationships between these tasks. Even if the model doesn't
indicate it, there possibly exists relationships between all data mining tasks mainly
depending on analysis goals and on the data to be analysed.
Six main phases can be distinguished in this process model:
Data understanding - this phase aims at getting a precise idea about data
available, identifying possible data quality issues, etc.
Data preparation - covers all activities meant to build the dataset to analyse
from the initial raw data. This includes cleaning, feature selection, sampling, etc.
Modeling - is the phase where several data mining techniques are parameter
and tested with the objective of optimizing the obtained data model or knowledge.
Evaluation - aims at verifying that the obtained model properly answers the
initially formulated business objectives and contributes to deciding whether the model
will be deployed or, on the contrary, will be rebuilt.
Deployment - is the final step of the cyclic data miningprocess model. Its
target is to take the obtained knowledge, put it in a convenient form and integrate it in
the business decision process. It can go, upon the objectives, from generating a report
76
describing the obtained knowledge to creating an specific application that will use the
obtained model to predict unknown values of a desired parameter.
Chapter 3
3.0 ANALYSIS
Due to high competition in the business field, it is essential to consider the customer
relationship management of the shopping centre. Here analyse the massive volume of
customer data and classify them based on the customer behaviours and prediction.
Customer relationship management is mainly used in sales forecasting and banking
areas. Data mining provides the technology to analyse mass volume of data and detect
hidden patterns in data to convert raw data into valuable information.
76
This work analyses DM techniques in Weka workbench, and reports the simulation
results of applying four DM techniques and classifiers in the open source workbench
to the Customer Relationship Management (CRM) for a shopping centre.
We are here to propose that data mining techniques to be used in aiding the
salesperson and management of the shopping centre for effective decision making.
This approach was applied to 100 pre-processed records. Simulation results show that
the large volume of customer historical data can play a value added role for
shopping centre development in a way that the mined data helps them to study
customer behaviour so that personalized services can be provided.
Our aim is to demonstrate the possibilities and draw attention to the possible
implications of improving customer satisfaction. The objectives of this work could
include increasing rental incomes and bringing new life back into shopping centre.
76
Above is the sample data for testing purpose. This testing consist of 100 pre-processed
customer records. Included fields are:
Sex
Age
Channel
Transportation
76
76
The customer data may contain certain attribute that will take larger values. Therefore
if the attributes are left unnormalized, we need to normalize that. Furthermore, it
would be useful for analysis to obtain aggregate information. The data transformation
operations, such as normalization and aggregation, are additional data pre-processing
procedures that would contribute toward the success of the mining process.
Correctly Classified
Incorrectly Classified
Kappa Statistic
76
We will show the results of the above evaluation criteria applied to two scenarios
based on the customer data records maintained by the shopping centre.
3.1.3 Methods
Four DM algorithms were tested, as follows:
Decision Tree (J48): J48 attempts to account for noise and missing data. It
also deals with numeric attributes by determining where thresholds for
decision splits should be placed. The main parameters that can be set for this
algorithm are the confidence threshold, the minimum number of instances per
leaf and the number of folds for reduced error pruning.
Association: This technique finds groups of items that tend to occur together
in a transaction. Searches for relationships between variables. For example a
supermarket might gather data on customer purchasing habits. Using
76
association rule learning, the supermarket can determine which products are
frequently bought together and use this information for marketing purposes.
This is sometimes referred to as market basket analysis. We also identified and
performed an association rule mining task. This involves:
(1) Finding rules, including appropriate parameter setting,
(2) Determining which of the resulting rules are interesting,
(3) Figuring out how the interesting rules could be useful.
Nave Bayes: Fig. 3.4 shows the output of the Nave Bayes algorithm that is used to
analyze the data.
76
Fig. 3.4 shows the result of analysis for transportation based on Nave
Bayes. The result reveals that both the male and female would like to use
private transport when travel to shopping center.
Decision Table: Fig. 3.5 shows the output for the case study that uses 100 training
instances, 1 rules, and it is a non matches covered by Majority class.
76
Decision Tree (J48): Fig. 3.6 shows the output produced by the J48 algorithm.
76
76
If Sex = female and transportation = public and age lesser or equal than 66 than credit
card
If Sex = female and transportation = public and age greater than 66 then cash
If Sex = male then credit card
Association: Fig. 3.7 shows the results of selecting the Apriori algorithm using the
Associate Rules. The algorithm provides many rules. Only a few rules are useful for
effective decision making. It cannot generate best rules because of insufficient data.
76
In order to make sure the Apriori algorithm of Associate Rules works well, some
new fields have been added into the sample data, relationship, region, brand and races.
Age have been removed due to the Apriori algorithm do not support numeric data.
76
3.1.5 Comparison between Nave Bayes (NB), Decision Table (DT) and Decision
Tree (J48)
Table 3.1 shows the comparison results of Nave Bayes (NB), Decision table (DT) and
J48. Overall, J48 gives better results than the DT and NB since J48 produces less
error.
Nave Bayes (NB)
Use Training Cross
Set
Correctly Classified
60
Incorrectly Classified
40
Kappa Statistic
0.0909
Mean Absolute Error
0.4562
Root Mean Squared Error 0.4777
Relative Absolute Error
94.97%
Validation
58
42
0.0455
0.4671
0.4897
97.23%
Percentage
Split
19
15
0.0449
0.4831
0.5114
99.37%
76
97.52%
99.97%
Decision Table (DT)
Correctly Classified
60
56
Incorrectly Classified
40
44
Kappa Statistic
0
-0.0577
Mean Absolute Error
0.4812
0.4855
Root Mean Squared Error 0.4899
0.4963
Relative Absolute Error
100.17%
101.05%
Root Relative Squared
102.28%
Error
100.01%
101.31%
Decision Tree (J48)
Correctly Classified
60
59
Incorrectly Classified
40
41
Kappa Statistic
0
-0.0199
Mean Absolute Error
0.48
0.4827
Root Mean Squared Error 0.4899
0.4954
Relative Absolute Error
99.9184 %
100.4763 %
Root Relative Squared
99.87%
Error
100.0864 %
99.9992 %
19
15
0
0.4868
0.4994
100.14%
19
15
0
0.4857
0.5004
99.9137 %
101.1204 %
Algorithm
correctly
instances
time taken
(second)
Nave
J4
Decision
Bayes
Table
58
59
56
0.03
classified
to
build
76
BUILD A MODEL.
76
Chapter 4
4.0 DESIGN
76
7. Build models.
8. Deploy models
As shown in Figure 4.1, data mining process is best considered as a set of settled
circles or nested loops instead of a straight line. The steps do have their order, but it is
76
not necessary to completely finish with one step before moving on to the following
step. After done with the following step, it may revisit the previous step.
4.1.1 Step One: Translate the business problem into a data mining problem
The first step is to explore the available data and make a list of candidate business
problems. A well-defined business problem will lead to the proper destination for data
mining project and solve the problem. Data mining goals for particular project should
be in more specific but not in broad and general. This make it easier to monitor
progress in achieving them. Example of specific goals:
List products whose sales are at risk if we discontinue wine and beer sales.
76
76
76
Figure 4.5 Data from the past mimics data from the past, present, and future.
Variables such as address, post, telephone number, email are useful information, but
not all the data mining algorithms can handle. So we have to fix the data by replacing
by other attributes.
76
76
The diagram illustrates the flow of data when a mining structure is processed, and
when a mining model is processed.
76
The process of prediction for data is scoring. The process of using the model is
different from the process that creates the model. A model is used multiple times after
it is created to score different databases. Example, it can use to predict the probability
of a customer whether it will purchase an item or not during the wholesale.
76
In the end, it will generate prediction number between 0 and 1 as the output and also
known as scoring.
76
Chapter 5
5.0 IMPLEMENTATION
76
A shopping centre want to know about their sales for the past 5 months, so that they
can forecast and achieve their target sales for the future months. Below are the specific
goals:
Above is a CSV file that contains 1000 user/customers profiles for testing purpose.
These data contain errors, inconsistent data and some records are lack of attribute
values. Data cleaning procedure is needed to clean the data before testing by filling
the missing values, smoothing noisy data, identifying or removing outliers and
resolving inconsistencies of data.
Included fields are:
userID
smoker
drink_level
76
dress_preference
ambience
transport
marital_status
hijos
birth_year
interest
personality
religion
activity
color
weight
budget
height
Upayment
Fcuisine
76
76
76
Amount of drinks
118
106
79
69
61
60
40
70
69
61
51
43
44
38
27
20
0 1
1
Month
Creating a model set for prediction on the amount of drinks that sold for the past 5
months based on the data set. When making a prediction, the predictive model uses
76
data from the past, finding patterns to make predictions about the future. From the
model set, we found out that the higher sales are alcohol drinks during the 5 months
periods. Thus, we should not discontinue beer sales. We can make promotion for
non_alcohol and juice during 3rd and 4th month to boost their sales.
76
The figure above is a fixed dataset after data cleaning process.Variables such as
address, post, telephone number, email are useful information, but not all the data
mining algorithm of this project can handle. So we have to choose certain attributes
that can be used in Associate Rules and fix the data by replacing by other attributes.
76
Compare the figure 5.7 and previous figure 5.6, there are some changes for the
income attribute. Associate Rules are unable to read the numeric data, so we have to
convert the numerical data into nominal data. Convert it to low, medium or high
instead of using numbering as the income attribute values.
76
The diagram illustrates the flow of data when a mining structure is processed, and
when a mining model is processed. The model filter into 3 models to create models.
To build model, we can use parameters to adjust the algorithm, apply filters to the
dataset, creating different results. The mining model object contains summaries and
patterns that can be used for prediction. Below are the figure of 3 models:
76
Model 1
Model 1
374
religion = non_muslim
155
107
alc o ho l
no n_alc o ho l
drinks
Model 2
juic e
76
model 2
104
96
84
food_preference = non_halal
beef
c hic ken
po r k
shopping_cart
Model 3
Model 3
618
349
1000 customers
c as h
c redit c ard
payment
33
debit c ard
76
76
76
76
Original
Number of association rules generated are 10. The total time is 47ms.
Modified
76
Number of association rules generated are 10. The total time is 44ms. The runtime of
the apriori algorithm have been improved.
Chapter 6
6.0 CONCLUSION
76
Time limitation as several courses requirements were due at the same time.
REFERENCES
Online Research
1) Data Mining: What is Data Mining
http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/
datamining.htm Date extracted: 24/6/2014
76
76
76
76
http://www.sv-europe.com/crisp-dm-methodology/ Date
extracted:21/8/2014
24) Association Rules Apriori Algorithm
https://fenix.tecnico.ulisboa.pt/downloadFile/3779571250083/licao_9.pdfDate
extracted: 29/9/14
25) Data Mining Applications & Trends
http://www.tutorialspoint.com/data_mining/dm_applications_trends.htm Date
extracted: 10/7/2014
26) GitHub
https://github.com/jashmenn/apriori
76
31) SPMF
http://www.philippe-fournier-viger.com/spmf/index.php?link=download.php
Date extracted: 21/1/2015
32) CODE PROJECT
http://www.codeproject.com/Articles/70371/Apriori-Algorithm
Date extracted: 20/1/2015
33) All My Brain
http://allmybrain.com/2007/11/12/implementing-the-apriori-data-miningalgorithm-with-javascript/Date extracted: 12/1/2015
34) CODE PROJECT
http://www.codeproject.com/Articles/70371/Apriori-Algorithm
Date extracted: 18/1/2015
35) stackoverflow
http://stackoverflow.com/questions/17125742/creating-k-itemsets-from-2itemsetsDate extracted: 16/1/2015
36) compilr
https://compilr.com/soniaj/apriori/Project.java
Date extracted: 22/1/2015
37) Apache Ant - Tutorial
http://www.vogella.com/tutorials/ApacheAnt/article.html
Date extracted: 23/1/2015
38) Uregina
76
http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/Apriori.javaDate
extracted: 22/1/2015
Reference Book
1) Data Mining Practical Machine Leaning Tools and Techniques Second Edition
by Ian H. Witten, Department of Computer Science, University of Waikato and
Eibe Frank, Department of Computer Science, University of Waikato.
APPENDIX
Project 1 Gantt Chart
Semester 2
No
.
Activities
Deeply
research in
apriori
algorithm
Select
appropriate
data
Analyse and
prepare
dataset for
simulation
21No
v
28No
v
5- 12- 19De De De
c
c
c
26De
c
2Ja
n
9Ja
n
16
Ja
n
23
Ja
n
30-Jan
76
4
5
6
Modify apriori
algorithm
Validate model
Documentation