Improving Association Rules

Final Year |Project 2
Development of Data Mining Algorithms for Analysing

Shopping Centre Dataset
by
Name: Khoo See Jun
ID: SN089817
Project Supervisor: Alicia Tang Yee Chong, Dr.
iv
DECLARATION
I hereby declare that this report, submitted to University Tenaga Nasional as a

partial fulfilment of the requirements for the Bachelor of Computer Science (System
and Networking) has not been submitted as an exercise for a degree at any other
university. I also certify that the work described here is entirely my own except for
excerpts and summaries whose sources are appropriately cited in the references.
This report may be made available within the university library and may be
photocopied or loaned to other libraries for the purposes of consultation.
23 February 2015
KHOO SEE JUN
SN089817
76
APPROVAL SHEET
This thesis entitled:

Development of Data Mining algorithms for analysing public domain databases
Submitted by:
KHOO SEE JUN (SN089817)
In requirement for the degree of Bachelor of Computer Science (System and
Networking), College of Information Technology, University TenagaNasional has
been accepted.
Supervisor: Alicia Tang Yee Chong, Dr.
Signature: .
Date:
76
ABSTRACT
The objectives of this project are to develop Data Mining algorithms for analysing
public domain databases. The public domain for this project is shopping centre. My
task for this project is to identify and perform an association rule mining task which
involves selecting an appropriate data set, preparing and preprocessing the data,
finding rules, including appropriate parameter setting, determining which of the
resulting rules are interesting and figuring out how the interesting rules could be
useful.
This work analyses well-known DM techniques in Weka workbench, and report the
simulation results using sample data by applying four selected DM techniques and
classifiers in the open source workbench to the Customer Relationship Management
(CRM) in shopping centre.
The design of the data mining process has been done in Chapter 4. This will show
how the data mining workflow.
The next section is implementation. The implementation begun with data mining
process (methodology).After that proceed with build models set for prediction. Then
compile modified code using Apache-Ant so that the code can be used by WEKA.
76
Lastly, generate best rules by import dataset into the WEKA software and compare the
run time between original code and modified code in different WEKA. The result
show that the run time have been improved with modified code.
TABLE OF CONTENTS
Page
DECLARATION
ii
APPROVAL SHEET
iii
ABSTRACT
iv
TABLE OF CONTENTS
LIST OF TABLES
ix
LIST OF FIGURES
CHAPTER 1 INTRODUCTION
1.0 INTRODUCTION
1.1 Project Introduction
1.2 Problem Statement
1.3 Objectives
1.4 Benefits
1.5 Project Scope
1.6 Expected Outcome
1.7 Gantt Chart
CHAPTER 2: RESEARCH AND LITERATURE REVIEW
1
2
2
3
4
5
5
76
2.0 LITERATURE REVIEW

2.1 Data Mining Techniques
2.1.1 Association
2.1.2 Classification
2.1.3 Prediction
2.1.4 Sequential Patterns (Long-term data)
2.1.5 Clustering
2.1.6 Decisions Trees (J48)
2.1.7Decision Table
2.2 Customer Relationship Management (CRM)
10
10
2.2.1 What is Customer Relationship Management (CRM)?
10
2.2.2 How CRM is Used Today
10
2.2.3 The CRM Strategy
11
2.2.4 The Impact of Technology on CRM
11
2.2.5 The Benefits of CRM
12
2.2.6 Data Mining and Customer Relationship Management
12
2.2.7 Review of Data Mining Tools in CRM
13
2.2.8 Data Mining Tools Applications in CRM
15
2.3 Data Mining Applications
17
2.3.1 Banking/Finance (Financial Data Analysis)
17
2.3.2 Retail/Marketing Industry
18
2.3.3 Telecommunication Industry
19
2.3.4 Biological Data Analysis
20
2.3.5 Medical/Pharma
20
76
2.3.6 Insurance and Health Car
21
2.3.7 Other Scientific Applications
21
2.3.8 Intrusion Detection
22
2.4 Data Mining Systems
23
2.4.1 Data Mining System Classification
23
2.4.2 Data Mining System Products
24
2.4.3 Choosing Data Mining System
25
2.4.4 Trends in Data Mining
27
2.5 Data Mining Process Model

2.5.1 Overview of Data Mining Life Cycle
28
28
CHAPTER 3: ANALYSIS
3.0 ANALYSIS
3.1 Data Mining for Shopping Centers
31
3.1.1 Free Sample Data for Testing Purpose
32
3.1.2 Related Work
33
3.1.3 Methods
35
3.1.4 Result and Discussion
36
3.1.5 Comparison between Nave Bayes (NB), Decision Table (DT) and
Decision Tree (J48)
42
3.1.6 Comparison between classifiers with time taken to build a model
3.2Association Rules Apriori Algorithm
3.2.1 Apriori Algorithm
3.2.2 Limitations of Apriori Algorithm
44
44
44
44
CHAPTER 4: DESIGN
4.0 DESIGN
4.1 Data Mining Process
46
76
4.1.1 Step One: Translate the business into a data mining problem
48
4.1.2 Step Two: Select appropriate data

4.1.3 Step Three: Analyze the data
4.1.4 Step Four: Create a Model Set for Prediction
4.1.5 Step Five: Fix Problem with the Data
4.1.6 Step Six: Transform Data to Bring Information to the Surface
4.1.7 Step Seven: Build Models
4.1.8 Step Eight: Deploy Models
48
48
50
51
52
53
54
CHAPTER 5: IMPLEMENTATION
5.0 IMPLEMENTATION
56
5.1.1 Translate the business into a data mining problem
57
5.1.2 Select appropriate data
57
5.1.3 Analyze the data
59
5.1.4 Create a Model Set for Prediction
61
5.1.5Fix Problem with the Data
62
5.1.6 Transform Data to Bring Information to the Surface
63
5.1.7 Build Models
64
5.1.8 Deploy Models
66
5.2 Apriori Algorithm Source Code
67
5.3 Import dataset into WEKA
69
CHAPTER 6: CONCLUSION
6.0 CONCLUSION
6.1 Progress and Outcome
70
6.2 Problems Encountered
71
76
6.3 Future Planning
71
REFERENCES
72
APPENDIX
77
LIST OF TABLES
Table No.
Page
TABLE 3.1
42
TABLE 3.2
44
76
LIST OF FIGURES
Figure No.
Page
Figure 2.1 Clustering (Sample Diagram)
Figure 2.2 Decision Tree (J48)
Figure 2.3 Data Mining Applications Useful For Companies
15
Figure 2.4 Data Mining System Classification
24
Figure 2.5 Data Mining Process Model
28
Figure 3.1 Sample Data (CSV format)
32
Figure 3.2 Sample Data (Notepad format)
33
Figure 3.3 Block Diagram
34
Figure 3.4 Results returned by the Nave Bayes classifier.
37
Figure 3.5Thedecision table of data analysis
38
Figure 3.6 J48 pruned tree of sex analysis
39
Figure 3.7 Associate Rules
40
Figure 3.8 Sample Data (CSV format)
41
Figure 3.9 Sample Data (Notepad format)
41
42
Figure 4.1 Data Mining is not a linear process
47
76
Figure 4.2 Sample data in ARFF Viewer
49
Figure 4.3 Data Visualize
49
Figure 4.4 Visualization of data by age and sex
50
Figure 4.5 Data from the past mimics data from the past, present, and future 51
Figure 4.6 Sample data
51
Figure 4.7 Data Mining Model
53
Figure 4.8 Data Mining Process Model
54
Figure 4.9 Data Mining Scoring Process Model
55
Figure 4.10 Scoring Prediction
55
Figure 5.1 Appropriate data
57
Figure 5.2 Analyse data in ARFF Viewer
59
Figure 5.3 Data Visualize
59
Figure 5.4 Visualization of data by smoker and drink_level
60
Figure 5.5 Prediction Model
61
Figure 5.6 Fixed dataset
62
Figure 5.7 Transformed data
63
Figure 5.8 Data Mining Model
64
Figure 5.9 Model 1
65
Figure 5.10 Model 2
65
76
Figure 5.11 Model 3
66
Figure 5.12 Original Code Part 1
67
Figure 5.13 Modified Code Part 1
67
68
68
Figure 5.16 Result of original code
69
Figure 5.17 Result of modified code
69
CHAPTER 1
1.0INTRODUCTION
1.1 Project Introduction

Far too many companies sit on loads of good customer data and do nothing with it. In
meanwhile they dont know that data is a gold mine of insight that can increase
customer loyalty, unlock hidden profitability and reduce client churn. By applying
data mining (knowledge discovery), theprocess used by companies to turn raw data
into useful information. By using computer-assisted software and go through the
process of digging and analyzing enormous sets of dataandextract the hidden
predictive information from large databases. It is a powerful new technology with
great potential to help companies focus on the most important information in their
76
data warehouses. In businesses can learn more about their customers and develop
more effective marketing strategies as well as increase sales and decrease costs.
Grocery stores are well-known users of data mining techniques. Many supermarkets
offer free loyalty cards to customers that give them access to reduced prices not
available to non-members. The cards make it easy for stores to track who is buying
what, when they are buying it, and at what price. The stores can then use this data,
after analyzing it, for multiple purposes, such as offering customers coupons that are
targeted to their buying habits and deciding when to put items on sale and when to sell
them at full price. Data mining tools predict behaviours and future trends, allowing
businesses to make proactive, knowledge-driven decisions. Data mining tools can
answer business questions that traditionally were too time consuming to resolve. They
scour databases for hidden patterns, finding predictive information that experts may
miss because it lies outside their expectations.
1.2 Problem Statement

Most of the companies have wasted a tons of useful customer data by doing nothing
on it. They do not know what exactly their customer need, what they are missing. By
engaging in data mining, we can gain greater insight into external conditions, internal
processes, companys market and their customers. We also gain predictive capabilities
that can be used both in strategic planning and in daily interactions. These insights and
predictive capabilities are taking a companys business results to the next level by
improving the companys marketing campaign management, up-sell and cross-sell
activities, or customer retention, risk analysis, or fraud detection efforts. This project
76
aim at a company to create powerful strategies, make fast and feasible decisions and
achieve competitive advantage in future.
1.3 Objectives
The objectives of this project are:
1. To identify parameters of the algorithm
2. To design new data mining algorithms
3. To develop new data mining algorithms
1.4 Benefits
The benefits of data mining in businesses are:
1. More Money. Money is always a good thing in business. When data is mined
that unearths the kinds of projects past donors contributed to, types of products
customers have purchased in the past, or a not-for-profit can put a number on
statistics for a grant proposal, it can result in serious cash. Once a business
knows who the top donors are or what their customers want, they can
customize approaches and outreach.
2. Improve Branding and Marketing. Data can reveal a number of things like
what direction the marketing department should take. For example, there might
have been a recent customer survey asking about what services or products
consumers want to see. That kind of information is gold, and a marketing
department can do wonders with it. If a survey or any feedback is being
collected, put it to use.
76
3. Streamline Outreach. Whether a business depends on e-mail blasts, print ads

or social media, knowing how customers want to be approached is important.
Data that includes relevant e-mail addresses, mailing addresses or social media
pages can help streamline any mailers or outreach. It also saves money,
whether it's in postage or time, by keeping consumer information updated.
4. Tap into New Markets.There are some databases available that businesses
can purchase, or the databases might be available to the public free of charge.
Business owners can use the databases of others to find out more information
about potential consumers and identify any holes in the current tactics.
However, when handling outside databases, it's especially important to
practice caution. Privacy is a big legal issue, and sometimes it's easy to overstep
boundaries.
5. Share and Share Alike. Sharing information is largely illegal, but it all
depends on what the customer has signed. For example, some coalitions may
share information on consumers in order to provide better services. This can be
dangerous grounds, but if it's legally acceptable, some business owners can
access the data of other partner organizations, too. This largely expands the
availability of information and can provide more data--and likely in turn more
accurate data--to improve the bottom line, services and research.
6. Learn from the Past. Data mining past information and comparing it to the
current situation can reveal a lot. Graphs can easily show any troubling sales
years, spikes or other trends that should be taken into consideration. Seeing the
ebb and flow of a business via data can provide insight that otherwise might be
overlooked. For example, a business that knows there's a history of high sales
76
in July can work on maximizing that month, while giving extra attention to
periods where sales slack.
1.5 Project Scope

Given databases of sufficient size and quality, data mining technology can generate
new business opportunities by providing these capabilities:
1. Select an appropriate data set
2. Preparing and pre-processing the data
3. Finding rules and identify parameter for the algorithm
1.6 Expected Outcome
The outcome of the project are:
1. A critical review data mining techniques
2. Dataset from a company
3. Design a new data mining algorithm
1.7 Gantt Chart
76
Chapter 2
2.0 LITERATURE REVIEW
2.1 Data Mining Techniques [6], [10], [11]

2.1.1 Association
Association (or relation) is probably the better known and most familiar and
straightforward data mining technique. A simple correlation between two or more
items, often of the same type to identify patterns. For example, when tracking people's
buying habits, we might identify that a customer always buys cream when they buy
strawberries, and therefore suggest that the next time that they buy strawberries they
might also want to buy cream.
2.1.2 Classification
We can use classification to build up an idea of the type of customer, item, or object
by describing multiple attributes to identify a particular class. For example, we can
76
easily classify cars into different types (sedan, 4x4, convertible) by identifying
different attributes (number of seats, car shape, driven wheels). Given a new car, we
might apply it into a particular class by comparing the attributes with our known
definition. We can apply the same principles to customers, for example by classifying
them by age and social group.
2.1.4 Prediction
Prediction is a wide topic and runs from predicting the failure of components or
machinery, to identifying fraud and even the prediction of company profits. Used in
combination with the other data mining techniques, prediction involves analyzing
trends, classification, pattern matching, and relation. By analyzing past events or
instances, we can make a prediction about an event. For example, using the credit card
authorization, we combine decision tree and classification to analysis an individual
past transaction to identify whether a transaction is fraudulent by matching the
historical pattern of the individual.
2.1.5 Sequential patterns (Long-term data)

Sequential patterns are a useful method for identifying trends, or regular occurrences
of similar events. For example, with customer data we can identify that customers buy
a particular collection of products together at different times of the year. In aonline
shopping website, we can use this information to automatically suggest that certain
76
items be added to their shopping cart based on their frequency and past purchasing
history.
2.1.6 Clustering
By examining one or more attributes or classes, we can group individual pieces of
data together to form a structure opinion. Clustering is using one or more attributes as
basis for identifying a cluster of correlating results. Clustering is useful to identify
different information because it correlates with other examples so we can see where
the similarities and ranges agree.
Clustering can work both ways. We can assume that there is a cluster at a certain point
and then use our identification criteria to see if we are correct. The graph in Figure
2.1 shows a good example. In this example, a sample of sales data compares the age
of the customer to the size of the sale. It is not unreasonable to expect that people in
their twenties (before marriage and kids), fifties, and sixties (when the children have
left home), have more disposable income.
76
Figure 2.1 Clustering (Sample Diagram).

(Adopted from http://www.ibm.com/developerworks/library/ba-data-mining-techniques/)
2.1.7 Decision trees(J48)

Related to most of the other techniques (primarily classification and prediction), the
decision tree can be used either as a part of the selection criteria, or to support the use
and selection of specific data within the overall structure. Within the decision tree, we
start with a simple question that has two (or sometimes more) answers. Each answer
leads to a further question to help classify or identify the data so that it can be
categorized, or so that a prediction can be made based on each answer.
Figure 2.2 shows an example where you can classify an incoming error condition.
76
Figure 2.2 Decision Tree (J48).

(Adopted from http://www.ibm.com/developerworks/library/ba-data-mining-techniques/)
Decision trees are often used with classification systems to attribute type information,
and with predictive systems, where different predictions might be based on past
historical experience that helps drive the structure of the decision tree and the output.
2.1.8 Decision Table
Decision tables, like decision trees, are classification models used for prediction. They
are induced by machine learning algorithms. A decision table consists of a hierarchical
table in which each entry in a higher level table gets broken down by the values of a
pair of additional attributes to form another table. The structure is similar to
dimensional stacking.
76
2.2 Customer Relationship Management (CRM) [15], [16], [17], [18]

2.2.1 What is Customer Relationship Management (CRM)?
CRM (customer relationship management) is an information industry term for
methodologies,
software,
and
usually
Internet
capabilities
that
help
company manage customer relationships in an organized way. For example, a

company might build a database about its customers that described relationships in
sufficient detail so that management, salespeople, people providing service, and
perhaps the customer directly could access information, match customer needs with
product plans and offerings, remind customers of service requirements, and know
what other products a customer had purchased, and so on.
2.2.2 How CRM is Used Today

CRM solutions provide a company with the customer business data to provide
services or products that customers want, provide better customer service, cross-sell
and up-sell more effectively, close deals, retain current customers and better
understand in the customer.
2.2.3 The CRM Strategy

Customer relationship management is often thought of as a business strategy that
enables businesses to improve in a number of areas. The CRM strategy allows a
company to following:
76
1) Understand the customer

2) Retain customers through better customer experience
3) Attract new customers
4) Win new clients and contracts
5) Increase profitably
6) Decrease customer management costs
2.2.4 The Impact of Technology on CRM

Technology and the Internet have changed the way companies approach customer
relationship strategies. Advances in technology have changed consumer buying
behaviour, and today there are many ways for companies to communicate with
customers and to collect data about them. With each new advance in technology
especially the proliferation of self-service channels like the Web and smartphones
customer relationships are being managed electronically.
Many aspects of customer relationship management rely heavily on technology;
however, the strategies and processes of a good CRM system will collect, manage and
link information about the customer with the goal of letting you market and sell
services effectively.
2.2.5 The Benefits of CRM
76
The biggest benefit most businesses realize when moving to a CRM system comes
directly from having all the business data stored and accessed from a single location.
Before CRM systems, customer data was spread out over office productivity suite
documents, email systems, mobile phone data and even paper note cards and Rolodex
entries. Storing all the data from all departments (e.g., sales, marketing, customer
service and HR) in a central location gives management and employees immediate
access to the most recent data when they need it. Departments can collaborate with
ease, and CRM systems help organization to develop efficient automated processes to
improve business processes.
2.2.6Data
Mining
and
Customer
Relationship
Management [17]
Customer relationship management (CRM) is a process that manages the interactions

between a company and its customers. The primary users of CRM software
applications are database marketers who are looking to automate the process of
interacting with customers.
To be successful, database marketers must first identify market segments containing

customers or prospects with high-profit potential. They then build and execute
campaigns that favourably impact the behaviour of these individuals.
The first task, identifying market segments, requires significant data about prospective
76
customers and their buying behaviours. In theory, the more data the better. In practice,
however, massive data stores often impede marketers, who struggle to sift through the
minutiae
to
find
the
nuggets
of
valuable
information.
Recently, marketers have added a new class of software to their targeting arsenal.
Data mining applications automate the process of searching the mountains of data to
find
patterns
that
are
good
predictors
of
purchasing
behaviours.
After mining the data, marketers must feed the results into campaign management
software that, as the name implies, manages the campaign directed at the defined
market segments.
In the past, the link between data mining and campaign management software was
mostly manual. In the worst cases, it involved "sneaker net," creating a physical file
on tape or disk, which someone then carried to another computer and loaded into the
marketing database.
This separation of the data mining and campaign management software introduces
considerable inefficiency and opens the door for human errors. Tightly integrating the
two disciplines presents an opportunity for companies to gain competitive advantage.
2.2.7Review of Data Mining Tools in CRM [18]
76
Data mining uses a combination of an explicit knowledge base, sophisticated

analytical skills, and domain knowledge to uncover hidden trends and patterns. These
trends and patterns form the basis of predictive models that enable analysts to produce
new observations from existing data. There are number of data mining tools available
in the market spaces that can provide the cutting edge for the firms to achieve
profitable CRM.
Data mining tools helps CRM by providing the complete framework, which covers:
To analyze the business problem.
To prepare the data requirements.
To build the suitable model with respect to business problem.
To validate and evaluate the designed model.
Model building is the next phase of the Data mining tool, which builds the various
models according to the data given in the data preparation phase. The last phase is the
evaluation of the model, so that the proper results in the form of useful patterns can be
drawn from the models built by the tools.
The tools of data mining for CRM should be able to detect the necessary information
from the available data .To achieve this, Data mining tools should have some
characteristic like:
User friendly environment
76
Efficiency of the tool
Basic task should be accomplished
Low cost of implementation
2.2.8 Data Mining Tools Applications in CRM [18]

Virtually any process from pharmacology to customer service can be studied,
understood, and improved using data mining. The top three end uses of data mining
are, not surprisingly, in the marketing area.
Figure
2.3
Data Mining Applications Useful For Companies.
(Adopted from http://www.informationweek.com/673/73iudat.htm)
Figure 2.3 shows that the Customer demographics are one of the most important
applications for the companies. The application of Data Mining tools are in:
76
Customer Profiling: In customer profiling, characteristics of good customers

are identified with the goals of predicting; who will become one and helping
marketers target new prospects. Data mining can find patterns in a customer
database that can be applied to a prospective database so that customer
acquisition can be appropriately targeted. For example, by identifying good
candidates for mail offers or catalogues direct-mail marketers can reduce
expenses and increase their sale
Targeted Marketing: Targeting specific promotions to existing and potential

customers offer similar benefits
Market-basket analysis: Market-basket analysis helps retailers understand

which products are purchased together or by an individual over time. With data
mining, retailers can determine which products to stock in which stores, and
even how to place them within a store. Data mining can also help assess the
effectiveness of promotions and coupons.
Manage customer relationship: Another common use of data mining in many

organizations is to help manage customer relationships. By determining
characteristics of customers who are likely to leave for a competitor, a
company can take action to retain that customer because doing so is usually far
less expensive than acquiring a new customer.
Fraud detection: Fraud detection is of great interest to telecommunications

firms, credit-card companies, insurance companies, stock exchanges, and
76
government agencies identify and track individual terrorists themselves, such

as through travel and immigration records.
Anticipate and prevent customer attrition: The data mining tool can help to
find the customers which are not satisfied by the firms services. This helps the
firms to give promotional services to group of customers who are likely to
attrite.
Mine unstructured data, such as text: The text data is always unstructured.
So data mining tools can help to mine the unstructured data to help the various
organizations to get good out of the data.
2.3 Data Mining Applications

Data mining is a data analysis approach that has been quickly adapted and used in a
large number of domains that were already using statistics. Here is the list of areas
where data mining is widely used:
Banking/Finance
Retail Industry
Telecommunication Industry
Biological Data Analysis
Medical/Pharma
76
Insurance and Health Care
Other Scientific Applications
Intrusion Detection
2.3.1 BANKING/FINANCE (FINANCIAL DATA ANALYSIS)

The financial data in banking and financial industry is generally reliable and of high
quality which facilitates the systematic data analysis and data mining. Here are the
few typical cases:
Design and construction of data warehouses for multidimensional data analysis

and data mining.
Loan payment prediction and customer credit policy analysis.
Classification and clustering of customers for targeted marketing.
Detection of money laundering and other financial crimes.
Detection of fraudulent credit card usage patterns.
Risk management related to attribution of loans using scorecards.
Find hidden correlations between different financial indicators.
Identification of stocks trading rules from historical market data.
76
2.3.2 RETAIL/MARKETING INDUSTRY

Data Mining has its great application in Retail Industry because it collects large
amount data from on sales, customer purchasing history, goods transportation,
consumption and services. It is natural that the quantity of data collected will continue
to expand rapidly because of increasing ease, availability and popularity of web.
The Data Mining in Retail Industry helps in identifying customer buying patterns and
trends. That leads to improved quality of customer service and good customer
retention and satisfaction. Here is the list of examples of data mining in retail industry:
Design and Construction of data warehouses based on benefits of data mining.
Multidimensional analysis of sales, customers, products, time and region.
Analysis of effectiveness of sales campaigns.
Customer Retention.
Product recommendation and cross-referencing of items.
Discovery of buying behaviour patterns
Detection of associations among customer characteristics.
Prediction of the probability that clients answer to mailing.
76
2.3.3 TELECOMMUNICATION INDUSTRY

Today the Telecommunication industry is one of the most emerging industries
providing various services such as fax, pager, cellular phone, Internet messenger,
images, e-mail, web data transmission etc. Due to the development of new computer
and communication technologies, the telecommunication industry is rapidly
expanding. This is the reason why data mining is become very important to help and
understand the business. Data Mining in Telecommunication industry helps in
identifying the telecommunication patterns, catch fraudulent activities, make better
use of resource, and improve quality of service. Here is the list examples for which
data mining improve telecommunication services:
Multidimensional Analysis of Telecommunication data.
Fraudulent pattern analysis.
Identification of unusual patterns.
Multidimensional association and sequential patterns analysis.
Mobile Telecommunication services.
Use of visualization tools in telecommunication data analysis.
76
2.3.4 BIOLOGICAL DATA ANALYSIS

Now a days we see that there is vast growth in field of biology such as genomics,
proteomics, functional Genomics and biomedical research. Biological data mining is
very important part of Bioinformatics. Following are the aspects in which data mining
contribute for biological data analysis:
Semantic integration of heterogeneous, distributed genomic and proteomic

databases.
Alignment, indexing, similarity search and comparative analysis multiple

nucleotide sequences.
Discovery of structural patterns and analysis of genetic networks and protein

pathways.
Association and path analysis.
Visualization tools in genetic data analysis.
2.3.5 MEDICAL/PHARMA
Data mining is a very important part in medical field. By getting through data mining,
research for new cure for rare diseases rate will be higher. Below are the aspects in
which data mining contribute for medical field:
Computer Assisted Diagnosis (expert systems learning)
76
Characterization/prediction of patient's response to product dosage
Identification of successful medical therapies (successful prescription

patterns).
Study of relations between dosage and potentially related adverse events
2.3.6 INSURANCE AND HEALTH CARE

Following is how the insurance companies manage their businesses and customer with
the help of data mining:
Discovery of medical procedures that are claimed together through claims

analysis
Identification of customers that are potential buyers for new policies.
Detection of behaviour patterns capable of identifying risky customers.
Detection of fraudulent behaviour.
2.3.7 OTHER SCIENTIFIC APPLICATIONS

The applications discussed above tend to handle relatively small and homogeneous
data sets for which the statistical techniques are appropriate. Huge amount of data
have been collected from scientific domains such as geosciences, astronomy etc.
76
There is large amount of data sets being generated because of the fast numerical
simulations in various fields such as climate, and ecosystem modelling, chemical
engineering, fluid dynamics etc. Following are the applications of data mining in field
of Scientific Applications:
Data Warehouses and data pre-processing.
Graph-based mining.
Visualization and domain specific knowledge.
2.3.8 INTRUSION DETECTION

Intrusion refers to any kind of action that threatens integrity, confidentiality, or
availability of network resources. In this world of connectivity security has become
the major issue. With increased usage of internet and availability of tools and tricks
for intruding and attacking network prompted intrusion detection to become a critical
component of network administration. Here is the list of areas in which data mining
technology may be applied for intrusion detection:
Development of data mining algorithm for intrusion detection.
Association and correlation analysis, aggregation to help select and build

discriminating attributes.
Analysis of Stream data.
76
Distributed data mining.
Visualization and query tools.
2.4 Data Mining Systems [13]

There is a large variety of Data Mining Systems available. Data mining System may
integrate techniques from the following:
Spatial Data Analysis
Information Retrieval
Pattern Recognition
Image Analysis
Signal Processing
Computer Graphics
Web Technology
Business
Bioinformatics
2.4.1 Data Mining System Classification [12]
76
The data mining system can be classified according to the following criteria:
Database Technology
Statistics
Machine Learning
Information Science
Visualization
Other Disciplines
Figure 2.4 Data Mining System Classification.

(Adopted from http://www.tutorialspoint.com/data_mining/dm_systems.htm)
76
2.4.2 Data Mining System Products [13]

There are many data mining system products and domain specific data mining
applications are available. The new data mining systems and applications are being
added to the previous systems. Also the efforts are being made towards
standardization of data mining languages.
2.4.3 Choosing Data Mining System
Which data mining system to choose will depend on following features of Data
Mining System:
Data Types - The data mining system may handle formatted text, record-based
data and relational data. The data could also be in ASCII text, relational database data
or data warehouse data. Therefore we should check what exact format, the data mining
system can handle.
System Issues - We must consider the compatibility of Data Mining system

with different operating systems. One data mining system may run on only on one
operating system or on several. There are also data mining systems that provide webbased user interfaces and allow XML data as input.
Data Sources - Data Sources refers to the data formats in which data mining
system will operate. Some data mining system may work only on ASCII text files
while other on multiple relational sources. Data mining system should also support
ODBC connections or OLE DB for ODBC connections.
76
Data Mining functions and methodologies - There are some data mining
systems that provide only one data mining function such as classification while some
provides multiple data mining functions such as concept description, discovery-driven
OLAP
analysis,
association
mining,
linkage
analysis,
statistical
analysis,
classification, prediction, clustering, outlier analysis, similarity search etc.
Coupling data mining with databases or data warehouse systems - Data

mining system need to be coupled with database or the data warehouse systems. The
coupled
components are integrated into a uniform information processing environment. Here
are the types of coupling listed below:
No coupling
Loose Coupling
Semi tight Coupling
Tight Coupling
Scalability - There are two scalability issues in Data Mining as follows:

Row (Database size) Scalability - Data mining System is considered
as row scalable when the number or rows are enlarged 10 times, It takes no more than
the 10 times to execute the query.
76
Column (Dimension) Scalability - Data mining system is considered
as column scalable if the mining query execution time increases linearly with number
of columns.
Visualization Tools - Visualization in Data mining can be categorized as

follows:
Data Visualization
Mining Results Visualization
Mining process visualization
Visual data mining
Data Mining query language and graphical user interface - The graphical
user interface which is easy to use and is required to promote user guided, interactive
data mining. Unlike relational database systems data mining systems do not share
underlying data mining query language.
2.4.4 Trends in Data Mining [25]

Here is the list of trends in data mining that reflects pursuit of the challenges such as
construction of integrated and interactive data mining environments, design of data
mining languages:
Application Exploration
76
Scalable and Interactive data mining methods
Integration of data mining with database systems, data warehouse systems and
web database systems.
Standardization of data mining query language
Visual Data Mining
New methods for mining complex types of data
Biological data mining
Data mining and software engineering
Web mining
Distributed Data mining
Real time data mining
Multi Database data mining
Privacy protection and Information Security in data mining
2.5 Data Mining Process Model [23]

CRISP-DM(Cross Industry Standard Process for Data Mining) stands for crossindustry process for data mining. The CRISP-DM methodology provides a structured
76
approach to planning a data mining project. It is a robust and well-proven

methodology.
2.5.1 Overview of Data Mining Life Cycle
Figure 2.5 Data Mining Process Model.

(Adopted from http://www.rithme.eu/?m=resources&p=dmmethod&lang=en)
Starting from the knowledge discovery processes used in early data mining projects,
CRISP-DM defined and validated a data mining process that could be applicable in
any industry sectors. This methodology should make large data mining projects faster,
cheaper, more reliable and more manageable. However, even small scale data mining
investigations can benefit from using it.
76
This process model provides a simple overview of the life cycle of a data mining
project. Corresponding phases of a data mining project are clearly identified
throughout tasks and relationships between these tasks. Even if the model doesn't
indicate it, there possibly exists relationships between all data mining tasks mainly
depending on analysis goals and on the data to be analysed.
Six main phases can be distinguished in this process model:
Business understanding - concerns the definition of the data mining problem

based on the business objectives.
Data understanding - this phase aims at getting a precise idea about data
available, identifying possible data quality issues, etc.
Data preparation - covers all activities meant to build the dataset to analyse
from the initial raw data. This includes cleaning, feature selection, sampling, etc.
Modeling - is the phase where several data mining techniques are parameter
and tested with the objective of optimizing the obtained data model or knowledge.
Evaluation - aims at verifying that the obtained model properly answers the
initially formulated business objectives and contributes to deciding whether the model
will be deployed or, on the contrary, will be rebuilt.
Deployment - is the final step of the cyclic data miningprocess model. Its
target is to take the obtained knowledge, put it in a convenient form and integrate it in
the business decision process. It can go, upon the objectives, from generating a report
76
describing the obtained knowledge to creating an specific application that will use the
obtained model to predict unknown values of a desired parameter.
Chapter 3
3.0 ANALYSIS
3.1 Data Mining for Shopping Centres

With the majority of large retailers offering a loyalty card scheme, the collection of
customer data is now routine commercial practice. Whilst loyalty schemes were
originally introduced to reward loyal customers and to encourage them to increase
their overall spend, retailers have been finding more and more sophisticated ways to
use customer data to their advantage.
Due to high competition in the business field, it is essential to consider the customer
relationship management of the shopping centre. Here analyse the massive volume of
customer data and classify them based on the customer behaviours and prediction.
Customer relationship management is mainly used in sales forecasting and banking
areas. Data mining provides the technology to analyse mass volume of data and detect
hidden patterns in data to convert raw data into valuable information.
76
This work analyses DM techniques in Weka workbench, and reports the simulation
results of applying four DM techniques and classifiers in the open source workbench
to the Customer Relationship Management (CRM) for a shopping centre.
We are here to propose that data mining techniques to be used in aiding the
salesperson and management of the shopping centre for effective decision making.
This approach was applied to 100 pre-processed records. Simulation results show that
the large volume of customer historical data can play a value added role for
shopping centre development in a way that the mined data helps them to study
customer behaviour so that personalized services can be provided.
Our aim is to demonstrate the possibilities and draw attention to the possible
implications of improving customer satisfaction. The objectives of this work could
include increasing rental incomes and bringing new life back into shopping centre.
3.1.1 Free Sample Data for Testing Purpose
76
Figure 3.1 Sample Data (CSV format).
Above is the sample data for testing purpose. This testing consist of 100 pre-processed
customer records. Included fields are:
Sex
Age
Channel
Transportation
All files are provided as CSV (comma-delimited).

Sex are gender, age are random. Channel is the way that the customer make payment,
with credit card or cash. Transportation is how the customer travel to their destination.
76
Figure 3.2 Sample Data (Notepad format).
3.1.2 Related Work

Before performing data mining need to perform the processes like data preparation
and data cleaning. Incomplete data were found in some of the records therefore data
preparation is needed. This means some records are lack of attribute values. Noisy
data contains errors and inconsistent data contains discrepancies in codes or names. In
data preparation need to select only the wanted fields from each table in order to
perform the data mining. Data reprocessing techniques like data cleaning and data
reduction were applied for conversion. Data cleaning procedure is used to clean the
data by filling the missing values, smoothing noisy data, identifying or removing
outliers and resolving inconsistencies. Additional data cleaning can be performed to
detect and remove redundancies still occur in the results obtained after data
integration. Data reduction produces a reduced representation of the data set that is
much smaller in volume and that should produce the same result.
76
Figure 3.3 Block Diagram.
The customer data may contain certain attribute that will take larger values. Therefore
if the attributes are left unnormalized, we need to normalize that. Furthermore, it
would be useful for analysis to obtain aggregate information. The data transformation
operations, such as normalization and aggregation, are additional data pre-processing
procedures that would contribute toward the success of the mining process.
Evaluation criteria: A rich set is available in Weka .

Only the following seven criteria are used:
Correctly Classified
Incorrectly Classified
Kappa Statistic
Mean Absolute Error
Root Mean Squared Error
Relative Absolute Error
Root Relative Squared Error
76
We will show the results of the above evaluation criteria applied to two scenarios
based on the customer data records maintained by the shopping centre.
3.1.3 Methods
Four DM algorithms were tested, as follows:
Nave Bayes Algorithm: Naive Bayes is a well-known in machine learning. It

is a simple and efficient learning method. The Naive Bayes classifier is an
approximation to an ideal Bayesian classifier which would classify an example
based on the probability of each class given the examples feature variables.
The main assumption is that the different features are independent of each
other given the class of the example.
Decision Table: Decision table is based on logical relationships just as the

truth table. It is a tool that helps us to look at the combination of both
completeness and inconsistency of conditions.
Decision Tree (J48): J48 attempts to account for noise and missing data. It
also deals with numeric attributes by determining where thresholds for
decision splits should be placed. The main parameters that can be set for this
algorithm are the confidence threshold, the minimum number of instances per
leaf and the number of folds for reduced error pruning.
Association: This technique finds groups of items that tend to occur together
in a transaction. Searches for relationships between variables. For example a
supermarket might gather data on customer purchasing habits. Using
76
association rule learning, the supermarket can determine which products are
frequently bought together and use this information for marketing purposes.
This is sometimes referred to as market basket analysis. We also identified and
performed an association rule mining task. This involves:
(1) Finding rules, including appropriate parameter setting,
(2) Determining which of the resulting rules are interesting,
(3) Figuring out how the interesting rules could be useful.
3.1.4 Result and Discussion

This section provides the simulation results produced by Weka. As noted earlier, three
types of classifiers are selected under theClassification technique, which are Nave
Bayes algorithm, Decision Table algorithm, and the J48 algorithm (Decision Tree), as
well as the Associative Rules.
Nave Bayes: Fig. 3.4 shows the output of the Nave Bayes algorithm that is used to
analyze the data.
76
Figure 3.4 Results returned by the Nave Bayes classifier.
Fig. 3.4 shows the result of analysis for transportation based on Nave
Bayes. The result reveals that both the male and female would like to use
private transport when travel to shopping center.
Decision Table: Fig. 3.5 shows the output for the case study that uses 100 training
instances, 1 rules, and it is a non matches covered by Majority class.
76
Figure 3.5 The decision table of data analysis.
Decision Tree (J48): Fig. 3.6 shows the output produced by the J48 algorithm.
76
Figure 3.6 J48 pruned tree of sex analysis.
The software listed all the possible rules of the decision.

Below are some of the simulation results:
If Sex = female and transportation = private then cash
76
If Sex = female and transportation = public and age lesser or equal than 66 than credit
card
If Sex = female and transportation = public and age greater than 66 then cash
If Sex = male then credit card
Association: Fig. 3.7 shows the results of selecting the Apriori algorithm using the
Associate Rules. The algorithm provides many rules. Only a few rules are useful for
effective decision making. It cannot generate best rules because of insufficient data.
76
In order to make sure the Apriori algorithm of Associate Rules works well, some
new fields have been added into the sample data, relationship, region, brand and races.
Age have been removed due to the Apriori algorithm do not support numeric data.
Figure 3.8 Sample Data (CSV format).
Figure 3.9 Sample Data (Notepad format).
76
Figure 3.10 Associate Rules.
Below is the result of the simulation:

Channel=Cash Transportation=Private Race=Chinese ==> Sex=Female
3.1.5 Comparison between Nave Bayes (NB), Decision Table (DT) and Decision
Tree (J48)
Table 3.1 shows the comparison results of Nave Bayes (NB), Decision table (DT) and
J48. Overall, J48 gives better results than the DT and NB since J48 produces less
error.
Nave Bayes (NB)
Use Training Cross
Set
60
40
Kappa Statistic
0.0909
Mean Absolute Error
0.4562
Root Mean Squared Error 0.4777
94.97%
Validation
58
42
0.0455
0.4671
0.4897
97.23%
Percentage
Split
19
15
0.0449
0.4831
0.5114
99.37%
76
Root Relative Squared

Error
97.52%
99.97%
Decision Table (DT)
60
56
40
44
Kappa Statistic
0
-0.0577
Mean Absolute Error
0.4812
0.4855
0.4963
100.17%
101.05%
102.28%
Error
100.01%
101.31%
Decision Tree (J48)
60
59
40
41
Kappa Statistic
0
-0.0199
Mean Absolute Error
0.48
0.4827
0.4954
99.9184 %
100.4763 %
99.87%
Error
100.0864 %
99.9992 %
19
15
0
0.4868
0.4994
100.14%
19
15
0
0.4857
0.5004
99.9137 %
101.1204 %
TABLE 3.1 COMPARISON BETWEEN NB, DT, J48 BY DATA NUMERIC.
3.1.6 Comparison between classifiers with time taken to build a model

The results in Table 3.2 show that J48 has the highest correctly classified followed by
Nave Bayes and lastly is the Decision Table algorithm. The longest time taken to
build model is Decision table followed by Nave Bayes and J48 algorithm.
Algorithm
correctly
instances
time taken
(second)
Nave
J4
Decision
Bayes
Table
58
59
56
0.03
classified
to
build
TABLE 3.2 COMPARISON BETWEEN CLASSIFIERS WITH TIME TAKEN TO
76
BUILD A MODEL.
3.2 Association Rules Apriori Algorithm [24]

3.2.1 Apriori Algorithm
Apriori algorithm is mining for associations among items in a large database of sales
transaction. It is an important database mining function. For example, the information
of a customer who purchase a keyboard also tends to but a mouse at the same time
3.2.1 Limitations of Apriori Algorithm

Apriori algorithm is simple and easy to execute, but has some limitation. The main
limitation is costly to handle a huge number of candidate sets with much frequent
itemsets, low minimum support or large itemsets. For example, if there are 10^4 from
frequent 1-itemsets, it need to generate more than 10^7 candidates into 2-length and
accumulate and test their occurrence frequencies. Moreover, to discover a frequent
pattern in size of 100. Example v1, v2, v3 v100, it must generate 2^100 candidate
itemsets in total on costly and wasting of time of candidate generation. Thus, it will
repeatedly scan the database and check large set of candidates by pattern matching.
Apriori algorithm will be very low efficiency when memory capacity is limited with
large number of transactions.
76
Chapter 4
4.0 DESIGN

The data mining process has 8 steps.
1. Translate the business problem into a data mining problem.
2. Select appropriate data.
3. Analyze the data.
4. Create a model set
5. Fix problems with the data.
6. Transform data
76
7. Build models.
8. Deploy models
Figure 4.1 Data Mining is not a linear process.
As shown in Figure 4.1, data mining process is best considered as a set of settled
circles or nested loops instead of a straight line. The steps do have their order, but it is
76
not necessary to completely finish with one step before moving on to the following
step. After done with the following step, it may revisit the previous step.
4.1.1 Step One: Translate the business problem into a data mining problem
The first step is to explore the available data and make a list of candidate business
problems. A well-defined business problem will lead to the proper destination for data
mining project and solve the problem. Data mining goals for particular project should
be in more specific but not in broad and general. This make it easier to monitor
progress in achieving them. Example of specific goals:
Identify customers who are unlikely to renew their subscriptions.
Forecast customer population in future months.
List products whose sales are at risk if we discontinue wine and beer sales.
4.1.2 Step Two: Select appropriate data

Data mining requires data. The data would be better if already be resident in a
corporate data warehouse, cleansed, available, historically accurate, and frequently
updated. The data sources that are useful and valuable, from problem to problem and
industry to industry. A few samples of useful data:
Point of sale data (coupons, discount)
Credit card charge records
Direct mail response records
4.1.3 Step Three: Analyze the data

A good step to examine the dataset and understand the data file from a new source.
Data visualization is the best way to know the data.
76
Figure 4.2 Sample data in ARFF Viewer.
Figure 4.3 Data Visualize.
76
Figure 4.4 Visualization of data by age and sex.
4.1.4 Step Four: Create a Model Set for Prediction

Creating a model set for prediction requires assembling data from different sources.
When making a prediction, the predictive model uses data from the past, finding
patterns to make predictions about the future. Time can always be divided into three
periods: the past, present, and future.
76
Figure 4.5 Data from the past mimics data from the past, present, and future.
4.1.5 Step Five: Fix Problem with the Data
Figure 4.6Sample data.
Variables such as address, post, telephone number, email are useful information, but
not all the data mining algorithms can handle. So we have to fix the data by replacing
by other attributes.
76
4.1.6 Step Six: Transform Data to Bring Information to the Surface

Once all the steps above have been done, it is the time to bring the information to the
surface by adding derived fields, combining multiple variables, creating ratios and
formula logarithms. Because of different person spend different money on a product,
maybe some of the buy more and some of the buy less. So it is wiser to convert the
money values to proportions of their spending.
76
4.1.7 Step Seven: Build Models

A sample model based on the sample data that used in Chapter 3.
Figure 4.7Data Mining Model.
The diagram illustrates the flow of data when a mining structure is processed, and
when a mining model is processed.
76
4.1.8 Step Eight: Deploy Models

Deploying a model means moving it from the data mining environment to the scoring
environment. Once a model has been created, the model can then be used to make
predictions for new data. The model would be built by using historical customer data.
This process is illustrated below:
Figure 4.8Data Mining Process Model.
The process of prediction for data is scoring. The process of using the model is
different from the process that creates the model. A model is used multiple times after
it is created to score different databases. Example, it can use to predict the probability
of a customer whether it will purchase an item or not during the wholesale.
76
Figure 4.9Data Mining Scoring Process Model.
In the end, it will generate prediction number between 0 and 1 as the output and also
known as scoring.
76
Figure 4.10Scoring Prediction.
Chapter 5
5.0 IMPLEMENTATION

The data mining process has 8 steps.
1. Translate the business problem into a data mining problem.
2. Select appropriate data.
3. Analyse the data.
4. Create a model set.
5. Fix problems with the data.
6. Transform data.
7. Build models.
8. Deploy models.
5.1.1 Translate the business problem into a data mining problem
Example Scenario
76
A shopping centre want to know about their sales for the past 5 months, so that they
can forecast and achieve their target sales for the future months. Below are the specific
goals:
List products whose sales are at risk if we discontinue beer sales.

Which products they should make promotion for the future months.
5.1.2. Select appropriate data

Data Cleaning Process (Before)
Figure 5.1Appropriate data.
Above is a CSV file that contains 1000 user/customers profiles for testing purpose.
These data contain errors, inconsistent data and some records are lack of attribute
values. Data cleaning procedure is needed to clean the data before testing by filling
the missing values, smoothing noisy data, identifying or removing outliers and
resolving inconsistencies of data.
Included fields are:
userID
smoker
drink_level
76
dress_preference
ambience
transport
marital_status
hijos
birth_year
interest
personality
religion
activity
color
weight
budget
height
Upayment
Fcuisine
76
5.1.3 Analyse the data

This step is to examine the dataset and understand the data file from a new source by
using Weka ARFF Viewer and Weka Explorer Visualize.
Figure 5.2 Analyse data in ARFF Viewer.
76
Figure 5.3Data Visualize
Figure 5.4 Visualization of data by smoker and drink_level.
76
5.1.4 Create a model set
Amount of drinks sold for the past 5 months

140
120
100
91
8082
Amount of drinks
118
106
79
69
61
60
40
70
69
61
51
43
44
38
27
20
0 1
1
Month
Figure 5.5 Prediction Model
Creating a model set for prediction on the amount of drinks that sold for the past 5
months based on the data set. When making a prediction, the predictive model uses
76
data from the past, finding patterns to make predictions about the future. From the
model set, we found out that the higher sales are alcohol drinks during the 5 months
periods. Thus, we should not discontinue beer sales. We can make promotion for
non_alcohol and juice during 3rd and 4th month to boost their sales.
5.1.5 Fix problems with the data

Data Cleaning Process (After)
76
Figure 5.6 Fixed dataset.
The figure above is a fixed dataset after data cleaning process.Variables such as
address, post, telephone number, email are useful information, but not all the data
mining algorithm of this project can handle. So we have to choose certain attributes
that can be used in Associate Rules and fix the data by replacing by other attributes.
5.1.6 Transform data
76
Figure 5.7 Transformed data.
Compare the figure 5.7 and previous figure 5.6, there are some changes for the
income attribute. Associate Rules are unable to read the numeric data, so we have to
convert the numerical data into nominal data. Convert it to low, medium or high
instead of using numbering as the income attribute values.
76
5.1.7 Build models
Figure 5.8 Data Mining Model.
The diagram illustrates the flow of data when a mining structure is processed, and
when a mining model is processed. The model filter into 3 models to create models.
To build model, we can use parameters to adjust the algorithm, apply filters to the
dataset, creating different results. The mining model object contains summaries and
patterns that can be used for prediction. Below are the figure of 3 models:
76
Model 1
Model 1
374
religion = non_muslim
155
107
alc o ho l
no n_alc o ho l
drinks
Figure 5.9 Model 1
Model 2
juic e
76
model 2
104
96
84
food_preference = non_halal
beef
c hic ken
po r k
shopping_cart
Figure 5.10 Model 2
Model 3
Model 3
618
349
1000 customers
c as h
c redit c ard
payment
Figure 5.11 Model 3
33
debit c ard
76
5.1.8 Deploy model

The last step in the data mining process, is to deploy the models that performed the
best to a production environment.
Use the models to create predictions, which you can then use to make business
decisions.
Update the models dynamically, as more data comes into the organization.
5.2 Apriori Algorithm Source Code
76
76
5.3 Import dataset into WEKA

The same dataset imported into Weka to test for their result by using original apriori
algorithm and modified apriori algorithm. Below figures are the results:
76
Original
Figure 5.16 Result of original code.
Number of association rules generated are 10. The total time is 47ms.
Modified
Figure 5.17 Result of modified code.
76
Number of association rules generated are 10. The total time is 44ms. The runtime of
the apriori algorithm have been improved.
Chapter 6
6.0 CONCLUSION
6.1 Progress and Outcome

In the first phase of this project have been completed successfully. First, the
problem has been identified with setting up a list of objectives to be achieved. Then,
research stage has started. This research stage involved conducting a literature review
as well as review for data mining techniques. Furthermore, research on Customer
Relationship Management (CRM) in data mining and data mining application areas
have been performed.
The analysis section begun with analyze the current situation of shopping
centre and also the way of people spend their money in past, present and future. But
the most important part was testing several data mining algorithms and make
comparison on which method is the best. Finally the design section has started by
designing a data mining process and model set.
The next section after design section is implementation. The implementation
begun with data mining process (methodology).After that proceed with build models
set for prediction. Then compile modified code using Apache-Ant so that the code can
be used by WEKA.Lastly, generate best rules by import dataset into the WEKA
software and compare the run time between original code and modified code in
different WEKA. The result show that the run time have been improved with modified
code.
76
6.2 Problems Encountered
Difficulty with using data mining tool, WEKA.
Difficulty with obtaining datasets.
Lack of experience in designing a model.
Lack of Internet resources.
Difficulty in modifying source code.
Time limitation as several courses requirements were due at the same time.
6.3 Future Planning

There isa new data mining software name SPMF. SPMF is an open-source
data mining mining library written in Java, specialized in pattern

mining. It offers implementations of 86 data mining algorithms for
sequential pattern mining, association rule mining, itemset mining,
sequential rule mining and clustering. I hope I can do some research on
this software and compare with my current project in the future.
REFERENCES
Online Research
1) Data Mining: What is Data Mining
http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/
datamining.htm Date extracted: 24/6/2014
76
2) Definition of Data Mining

http://www.laits.utexas.edu/~anorman/BUS.FOR/course.mat/Alex/ Date
extracted: 24/6/2014
3) Investopedia explains Data Mining
http://www.investopedia.com/terms/d/datamining.asp Date extracted:
24/6/2014
4) Oracle Data Mining Concepts
http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/process.htm#CHD
FGCIJ Date extracted: 30/6/2014
5) Resources of Data Mining http://www.rithme.eu/?
m=resources&p=resources&lang=en Date extracted: 30/6/2014
6) Data Mining Techniques
http://www.ibm.com/developerworks/opensource/library/ba-data-miningtechniques/index.html Date extracted: 6/7/2014
7) Carry Out Data Mining and Machine Learning with Weka
http://www.opensourceforu.com/2014/03/carry-data-mining-machine-learningweka/ Date extracted: 10/7/2014
8) An Introduction to Data Mining

http://www.thearling.com/text/dmwhite/dmwhite.htm Date extracted:
27/6/2014
9) How Business Can Benefit from Data Mining
http://www.tmcnet.com/topics/articles/2013/03/21/331429-how-businessesbenefit-from-data-mining.htm Date extracted: 27/6/2014
76
10) An Overview of Data Mining Techniques

http://www.thearling.com/text/dmtechniques/dmtechniques.htm Date
extracted:
11) Data Mining Techniques
http://www.uta.edu/faculty/sawasthi/Statistics/stdatmin.html#index Date
extracted: 7/7/2014
12) Data Mining Classification
http://www.tutorialspoint.com/data_mining/dm_classification_prediction.htm
Date extracted: 17/7/2014
13) Data Mining System
http://www.tutorialspoint.com/data_mining/dm_systems.htm Date extracted:
14) Data Mining Process Model http://www.rithme.eu/?
m=resources&p=dmmethod&lang=enDate extracted:
15) CRM Customer Relationship Management
http://www.webopedia.com/TERM/C/CRM.html Date extracted: 1/8/2014
16) What is CRM? http://searchcrm.techtarget.com/definition/CRM, posted by
Margaret Rouse. Date extracted: 11/8/2014
17) Data Mining and Customer Relationships
http://www.thearling.com/text/whexcerpt/whexcerpt.htm, by Kurt Thearling.
18) A Review of Data Mining Tools in Customer Relationship Management

http://www.tlainc.com/articl149.htm, Journal of Knowledge Management
Practice, Vol. 9, No. 1, March 2008 - Jayanthi Ranjan, Institute of
76
Management Technology, Ghaziabad, Vishal Bhatnagar, Indraprastha

University, Delhi. Date extracted: 19/8/2014
19) Data Mining for Shopping Centres Customer Knowledge Management
Framework
http://bura.brunel.ac.uk/bitstream/2438/1471/1/KMSCBasedOnChapshortV5.p
df Date extracted:30/8/2014
20) Customer Classification And Prediction Based On Data Mining Technique
http://www.ijetae.com/files/Volume2Issue12/IJETAE_1212_58.pdfDate
extracted:14/8/2014
21) Data Mining Techniques: For Marketing, Sales, and Customer Relationship
Management http://books.google.com.my/books?
id=AyQfVTDJypUC&pg=PA162&lpg=PA162&dq=Membership+Supermarke
t
%27s+Customer+in+data+mining&source=bl&ots=KWFyqsQYyK&sig=Uyh
kDWZ2kHDBxXVtW9nx5SnTIo&hl=en&sa=X&ei=cZ_8U5_2KoWE8gW9_4CADA&redir_
esc=y#v=onepage&q=Membership%20Supermarket's%20Customer%20in
%20data%20mining&f=false Date extracted:13/8/2014
22) How Do Supermarkets Use Your Data?
http://www.select-statistics.co.uk/article/blog-post/how-do-supermarkets-useyour-data Date extracted:29/8/2014
23) What is the CRISP-DM methodology?
76
http://www.sv-europe.com/crisp-dm-methodology/ Date
extracted:21/8/2014
24) Association Rules Apriori Algorithm
https://fenix.tecnico.ulisboa.pt/downloadFile/3779571250083/licao_9.pdfDate
extracted: 29/9/14
25) Data Mining Applications & Trends
http://www.tutorialspoint.com/data_mining/dm_applications_trends.htm Date
26) GitHub
https://github.com/jashmenn/apriori

27) Association Mining with Weka
http://facweb.cs.depaul.edu/mobasher/classes/ect584/weka/associate.html
28) Association Mining with Weka
http://facweb.cs.depaul.edu/mobasher/classes/ect584/weka/associate.html
29) AprioriItemset Generation
http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/itemset_apriori.html
30) Pentaho Data Mining
http://wiki.pentaho.com/display/DATAMINING/Apriori
76
31) SPMF
http://www.philippe-fournier-viger.com/spmf/index.php?link=download.php
32) CODE PROJECT
http://www.codeproject.com/Articles/70371/Apriori-Algorithm
33) All My Brain
http://allmybrain.com/2007/11/12/implementing-the-apriori-data-miningalgorithm-with-javascript/Date extracted: 12/1/2015
34) CODE PROJECT
http://www.codeproject.com/Articles/70371/Apriori-Algorithm
35) stackoverflow
http://stackoverflow.com/questions/17125742/creating-k-itemsets-from-2itemsetsDate extracted: 16/1/2015
36) compilr
https://compilr.com/soniaj/apriori/Project.java
37) Apache Ant - Tutorial
http://www.vogella.com/tutorials/ApacheAnt/article.html
38) Uregina
76
http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/Apriori.javaDate
Reference Book
1) Data Mining Practical Machine Leaning Tools and Techniques Second Edition
by Ian H. Witten, Department of Computer Science, University of Waikato and
Eibe Frank, Department of Computer Science, University of Waikato.
APPENDIX
Project 1 Gantt Chart
Semester 2
No
.
Activities
Deeply
research in
apriori
algorithm
Select
appropriate
data
Analyse and
prepare
dataset for
simulation
21No
v
28No
v
5- 12- 19De De De
c
c
c
26De
c
2Ja
n
9Ja
n
16
Ja
n
23
Ja
n
30-Jan
76
4
5
6
Modify apriori
algorithm
Validate model
Documentation
Project 2 Gantt Chart

Improving Association Rules

Uploaded by

Copyright:

Available Formats

You might also like

Improving Association Rules

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Improving Association Rules

Uploaded by

Copyright:

Available Formats

Final Year |Project 2

Development of Data Mining Algorithms for Analysing

Project Supervisor: Alicia Tang Yee Chong, Dr.

I hereby declare that this report, submitted to University Tenaga Nasional as a

This thesis entitled:

Supervisor: Alicia Tang Yee Chong, Dr.

CHAPTER 2: RESEARCH AND LITERATURE REVIEW

2.0 LITERATURE REVIEW

2.1.4 Sequential Patterns (Long-term data)

2.1.6 Decisions Trees (J48)

2.2.1 What is Customer Relationship Management (CRM)?

2.2.2 How CRM is Used Today

2.2.3 The CRM Strategy

2.2.4 The Impact of Technology on CRM

2.2.5 The Benefits of CRM

2.2.6 Data Mining and Customer Relationship Management

2.2.7 Review of Data Mining Tools in CRM

2.2.8 Data Mining Tools Applications in CRM

2.3 Data Mining Applications

2.3.1 Banking/Finance (Financial Data Analysis)

2.3.2 Retail/Marketing Industry

2.3.3 Telecommunication Industry

2.3.4 Biological Data Analysis

2.3.6 Insurance and Health Car

2.3.7 Other Scientific Applications

2.3.8 Intrusion Detection

2.4 Data Mining Systems

2.4.1 Data Mining System Classification

2.4.2 Data Mining System Products

2.4.3 Choosing Data Mining System

2.4.4 Trends in Data Mining

2.5 Data Mining Process Model

4.1.2 Step Two: Select appropriate data

5.1.1 Translate the business into a data mining problem

5.1.2 Select appropriate data

5.1.3 Analyze the data

5.1.4 Create a Model Set for Prediction

5.1.5Fix Problem with the Data

5.1.6 Transform Data to Bring Information to the Surface

5.1.7 Build Models

5.1.8 Deploy Models

5.2 Apriori Algorithm Source Code

5.3 Import dataset into WEKA

6.2 Problems Encountered

6.3 Future Planning

Figure 2.1 Clustering (Sample Diagram)

Figure 2.2 Decision Tree (J48)

Figure 2.3 Data Mining Applications Useful For Companies

Figure 2.4 Data Mining System Classification

Figure 2.5 Data Mining Process Model

Figure 3.1 Sample Data (CSV format)

Figure 3.2 Sample Data (Notepad format)

Figure 3.3 Block Diagram

Figure 3.4 Results returned by the Nave Bayes classifier.

Figure 3.5Thedecision table of data analysis

Figure 3.6 J48 pruned tree of sex analysis

Figure 3.7 Associate Rules

Figure 3.8 Sample Data (CSV format)