Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 40

Presented to

Prof. Sweta Agarawa


Presented by:
Madhav Kumar jha(09098)
Kshitij Pandey(09094)
Kunal Saurab(09096)
Kapil Chaudhry(09091)
Gangadhar G.(09073)
Ekta Ahuja(09070)
Manpreet Kaur(09102)
Neeti Shree(09116)
Annindya
• Association rule mining finds interesting association or
correlation relationships among a large set of data items.
• This can help in many business decision making processes:
store layout, catalog design, and customer segmentation based
on buying paterns. Another important field: medical
applications.
• Market basket analysis - a typical example of association rule
mining.
• How can we find association rules from large amounts of data?
Which association rules are the most interesting. How can we
help or guide the mining procedures?

2
• Given a set of database transactions, where each transaction is a set
of items, an association rule is an expression
XY
where X and Y are sets of items (literals). The intuitive meaning
of the rule: transactions in the database which contain the items in X
tend to also contain the items in Y.
• Example: 98% of customers who purchase tires and auto accessories
also buy some automotive services; here 98% is called the
confidence of the rule. The support of the rule
is the percentage of transactions that contain both X and Y.
• The problem of mining association rules is to find all rules that
satisfy a user-specified minimum support and minimum confidence.

3
 Consider shopping cart filled with several items
 Market basket analysis tries to answer the following
questions:
Who makes purchases?
What do customers buy together?
In what order do customers purchase items?
 Prompts other decisions:
Where to place items in the store? e.g., Together? Apart?
What items should we put on sale (not put on sale)?
Let’s look at a concrete example of Apriori, based on the
AllElectronics transaction database D, shown below. There are
nine transactions in this database, e.i., |D| = 9. We use the next
figure to illus-
trate the fin- TID List of item_Ids
ding of fre-
quent itemsets T100 I1, I2, I5
in D. T200 I2, I4
T300 I2, I3
T400 I1, I2, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3

8
Scan D for Itemset Sup. count Itemset Sup. count
Compare candidate
count of each {I1} 6 {I1} 6
support count with
candidate- scan {I2} 7 {I2} 7
minimum support
{I3} 6 {I3} 6
count - compare
{I4} 2 {I4} 2
{I5} 2 {I5} 2
C1 L1
Itemset Itemset Sup. count
{I1,I2} {I1,I2} 4 Itemset Sup. count
Generate C2
{I1,I3} {I1,I3} 4 {I1,I2} 4
candidates {I1,I4} {I1,I4} 1
from L1
Scan Compare {I1,I3} 4
{I1,I5} {I1,I5} 2 {I1,I5} 2
{I2,I3} {I2,I3} 4 {I2,I3} 4
{I2,I4} {I2,I4} 2 {I2,I4} 2
{I2,I5} {I2,I5} 2 {I2, I5} 2
{I3,I4} {I3,I4} 0
{I3,I5} {I3,I5} 1 L2
{I4,I5} {I4,I5} 0
C2
C2
9
Generate C3 Itemset Itemset Sup. Count Itemset Sup. Count
candidates {I1,I2,I3} {I1,I2,I3} 2 {I1,I2,I3} 2
from L2 Scan Compare
{I1,I2,I5} {I1,I2,I5} 2 {I1,I2,I5} 2
C3 C3 L3

10
Established software for fast effective discovery of real
associations.
• Magnum Opus is the only association discovery software that finds the
core associations in data and discards the rest.
• Some other special features are :

1) easy to use and does not presume advanced knowledge of statistics or


machine learning.
2) proven software, first released in 1999.
3) designed by data miners for data miners.  It addresses the real challenges
of transforming data to knowledge.
4) fast. It has linear compute time. So long as your data fit in your computer's
physical memory, if the amount of data is doubled then the compute time
will approximately double.
 Magnum Opus version 4.6 is most powerful.

 It is widely used in scientific research.

 it can be used for contrast discovery (also known as emerging pattern


discovery and closely related to subgroup discovery).
• A simple example:
• We start with a simple invented example of analyzing the purchasing habits of a
customer of a fictitious grocery store.  The customer has visited the store on ten
occasions, each time buying a different selection of goods.

  The following item-list file records the customer’s purchasing behavior.


  Each line represents the items bought on a single visit.
plums, lettuce, tomatoes
celery, confectionery
apples, carrots, tomatoes , potatoes
Potatoes
Confectionery
Carrots
apples, oranges, lettuce, tomatoes
peaches, oranges, celery, potatoes, confectionery
oranges, lettuce, carrots, tomatoes
apples, bananas, plums, carrots, tomatoes, onions
• These can be processed by Magnum Opus to find rules such
as the following four.
apples -> tomatoes [Coverage=0.300 (3); Support=0.300 (3);
Strength=1.000; Lift=2.00; Leverage=0.1500 (1.5)]
lettuce -> tomatoes [Coverage=0.300 (3); Support=0.300 (3);
Strength=1.000; Lift=2.00; Leverage=0.1500 (1.5)]
tomatoes -> apples [Coverage=0.500 (5); Support=0.300 (3);
Strength=0.600; Lift=2.00; Leverage=0.1500 (1.5)]
tomatoes & oranges -> lettuce [Coverage=0.200 (2);
Support=0.200 (2); Strength=1.000; Lift=3.33;
Leverage=0.1400 (1.4)]
1. Each rule presents a list of items to the left of the arrow that are associated with
the single item to the right of the arrow.
2.Then a number of relevant statistics are presented that describe the nature of the
association. Thus, the first two of these rules indicate that whenever either
apples or lettuce are purchased, tomatoes are also purchased. 
3. The third and fourth rules indicate that both apples and lettuce are more likely to
be purchased if tomatoes are purchased. 
4. The final rule shows that whenever both tomatoes and oranges are purchased,
lettuce is also purchased.
This is a very simplistic example.  In practice it would be foolish to draw strong
conclusions from such limited data. 
Indeed, Magnum Opus includes facilities for assessing the strength of evidence in
support of a rule, and these mechanisms would reject all the above rules as
having insufficient support. 
This example is intended to illustrate the type of analysis that Magnum Opus
performs, albeit, normally on much larger volumes of more complex data.
SAS Enterprise Miner streamlines the data mining process
to create highly accurate predictive and descriptive models
based on analysis of vast amounts of data from across the
enterprise.
It offers a rich, easy-to-use set of integrated capabilities
for creating and sharing insights that can be used to drive
better decisions.
SAS data mining software
•detect fraud,
• minimize risk,
•anticipate resource demands,
• increase response rates for marketing campaigns and
curb customer attrition
• Support the entire data mining process with a broad
set of tools.
• Build better models with a versatile data mining
workbench.
• Enable business analysts and subject-matter experts
with limited statistical skills to generate predictive
models for a variety of business scenarios.
• Enhance accuracy of predictions and easily share
reliable information to improve the quality of
decisions.
• Ease the model deployment and scoring process.
• Powerful, easy-to-use GUI, as well as batch
processing for large jobs
• Data preparation, summarization and exploration
• Advanced predictive and descriptive modeling
• Fast, easy and self-sufficient way for business users
to generate models
• Business-based model comparisons, reporting and
management
• Automated scoring process
• Open, extensible design
• Scalable processing
 Data access, management and cleansing are seamlessly integrates making
it easier to prepare data for analysis.

 Robust variable selection and data modification tools improve the quality
of data , which leads to better modeling and more reliable results.

 With multithreaded algorithms and support for multiprocessing and grid


computing , execution time is reduced and hardware resources are used
more efficiently.

 Business analysts and subject matter experts can quickly generate


predictive models using the SAS Rapid Predictive Modeler.
 IBM's data mining capabilities help you detect fraud, segment your
customers, and simplify market basket analysis.
 IBM's in-database mining capabilities integrate with your existing
systems to provide scalable, high performing predictive analysis without
moving your data into proprietary data mining platforms.
 Use SQL, Web Services, or Java to access DB2's data mining capabilities
from your own applications or business intelligence tools from IBM's
business partners.
 DB2 Intelligent Miner for Data performs mining functions against
traditional DB2 databases or flat files. It also has capabilities to access
data in other relational DBMSs using ODBC.

 However to do this IBM’s DataJoiner must be used. It is implemented


using a client – server approach with a straightforward GUI interface
providede to assist the user in choosing data mining functions. Several
visualization techniques are used.
 There are two other products in the IBM I ntelligent Miner family.
Intelligent Miner for text performs mining activities against textual data ,
including e-mail and web pages. It consists of text analysis tools include
the ability to cluster , classify. Summarize and extract important features
from a document.
 NetQuestion Solution is a set of too;s to facilitate indexing and searching
web documents. DB2 Intelligent Miner Scoring allows SQL applications
the ability to request data mining applications against a DB2 or Oracle
database. It is a user defined extension to DB2. It can be used to
determine the actual score that a record has with respect to user defined
ranking criteria
 Uses association technique
 to sell its services via direct mail
 Runs a neural net model
Applied association rules in library
(VIDEO)
 Association rules are generated of the general form if Body then Head,
where Body and Head stand for single codes or text values (items) or
conjunctions of codes or text values (items; e.g., if (Car=Porsche and
Age<20) then (Risk=High and Insurance=High)
 For example, consider the data that describe a (fictitious) survey of 100
patrons of sports bars and their preferences for watching various sports
on television. This would be an example of simple categorical variables,
where each variable represents one sport. For each sport, each
respondent indicated how frequently s/he watched the respective type of
sport on television. The association rules derived from these data could
be summarized as follows:

 in this graphical summary, the strongest support value was found for
Swimming=Sometimes, which was associated Gymnastic=Sometimes,
Baseball = Sometimes, and Basketball=Sometimes
 in the 2D Association Network, the support values for the Body and Head
portions of each association rule are indicated by the sizes and colors of
each circle in the 2D. The thickness of each line indicates the confidence
value (joint probability) for the respective association rule; the sizes and
colors of the "floating" circles plotted against the (vertical) z-axis indicate
the joint support (for the co-occurrences) of the respective Body and Head
components of the association rules. The plot position of each circle along
the vertical z - axis indicates the respective confidence value. Hence, this
particular graphical summary clearly shows two simple rules: Respondents
who name Pizza as a preferred fast food also mention Hamburger, and vice
versa.
 his is an example of how association rules can be applied to text mining
tasks. This analysis was performed on the paragraphs (dialog spoken by
the characters in the play) in the first scene of Shakespeare's "All's Well
That Ends Well," after removing a few very frequent words like is, of, etc.
Of course, the specific words and phrases removed during the data
preparation phase of text (or data) mining projects will depend on the
purpose of the research.
y ou
nk
a 2 0 10
h
T 2-11-
2

You might also like