Professional Documents
Culture Documents
DWM Unit V
DWM Unit V
This process analyzes customer buying habits by finding associations between the
different items that customers place in their shopping baskets. The discovery of such
associations can help retailers develop marketing strategies by gaining insight into
which items are frequently purchased together by customers. For instance, if
customers are buying milk, how likely are they to also buy bread (and what kind of
bread) on the same trip to the supermarket. Such information can lead to increased
sales by helping retailers do selective marketing and plan their shelf space.
Example:
If customers who purchase computers also tend to buy antivirus software at the same
time, then placing the hardware display close to the software display may help increase
the sales of both items. In an alternative strategy, placing hardware and software at
opposite ends of the store may entice customers who purchase such items to pick up
other items along the way. For instance, after deciding on an expensive computer,a
customer may observe security systems for sale while heading toward the software
display to purchase antivirus software and may decide to purchase a home security
system as well. Market basket analysis can also help retailers plan which items to put
on sale at reduced prices. If customers tend to purchase computers and printers
together, then having a sale on printers may encourage the sale of printers as well as
computers.
Market Basket Analysis in Data Mining
Market basket analysis is a data mining technique used by retailers to increase sales
by better understanding customer purchasing patterns. It involves analyzing large
data sets, such as purchase history, to reveal product groupings and products that
are likely to be purchased together.
The adoption of market basket analysis was aided by the advent of electronic point-
of-sale (POS) systems. Compared to handwritten records kept by store owners, the
digital records generated by POS systems made it easier for applications to process
and analyze large volumes of purchase data.
One example is the Shopping Basket Analysis tool in Microsoft Excel, which analyzes
transaction data contained in a spreadsheet and performs market basket analysis. A
transaction ID must relate to the items to be analyzed. The Shopping Basket Analysis
tool then creates two worksheets:
o The Shopping Basket Item Groups worksheet, which lists items that are frequently
purchased together,
o And the Shopping Basket Rules worksheet shows how items are related (For example,
purchasers of Product A are likely to buy Product B).
o Antecedent: Items or 'itemsets' found within the data are antecedents. In simpler
words, it's the IF component, written on the left-hand side. In the above example,
bread is the antecedent.
o Consequent: A consequent is an item or set of items found in combination with the
antecedent. It's the THEN component, written on the right-hand side. In the above
example, butter is the consequent.
1. Descriptive market basket analysis: This type only derives insights from past data
and is the most frequently used approach. The analysis here does not make any
predictions but rates the association between products using statistical techniques.
For those familiar with the basics of Data Analysis, this type of modelling is known as
unsupervised learning.
2. Predictive market basket analysis: This type uses supervised learning models like
classification and regression. It essentially aims to mimic the market to analyze what
causes what to happen. Essentially, it considers items purchased in a sequence to
determine cross-selling. For example, buying an extended warranty is more likely to
follow the purchase of an iPhone. While it isn't as widely used as a descriptive MBA, it
is still a very valuable tool for marketers.
3. Differential market basket analysis: This type of analysis is beneficial for competitor
analysis. It compares purchase history between stores, between seasons, between
two time periods, between different days of the week, etc., to find interesting
patterns in consumer behaviour. For example, it can help determine why some users
prefer to purchase the same product at the same price on Amazon vs Flipkart. The
answer can be that the Amazon reseller has more warehouses and can deliver faster,
or maybe something more profound like user experience.
Algorithms that use association rules include AIS, SETM and Apriori. The Apriori
algorithm is commonly cited by data scientists in research articles about market
basket analysis. It identifies frequent items in the database and then evaluates their
frequency as the datasets are expanded to larger sizes.
R's rules package is an open-source toolkit for association mining using the R
programming language. This package supports the Apriori algorithm and other
mining algorithms, including arulesNBMiner, opusminer, RKEEL and RSarules.
With the help of the Apriori Algorithm, we can further classify and simplify the item
sets that the consumer frequently buys. There are three components in APRIORI
ALGORITHM:
o SUPPORT
o CONFIDENCE
o LIFT
For example, suppose 5000 transactions have been made through a popular e-
Commerce website. Now they want to calculate the support, confidence, and lift for
the two products. For example, let's say pen and notebook, out of 5000 transactions,
500 transactions for pen, 700 transactions for notebook, and 1000 transactions for
both.
SUPPORT
It has been calculated with the number of transactions divided by the total number
of transactions made,
CONFIDENCE
Whether the product sales are popular on individual sales or through combined sales
has been calculated. That is calculated with combined transactions/individual
transactions.
LIFT
Lift-> 20/10=2
When the Lift value is below 1, the combination is not so frequently bought by
consumers. But in this case, it shows that the probability of buying both the things
together is high when compared to the transaction for the individual items sold.
o Increasing market share: Once a company hits peak growth, it becomes challenging
to determine new ways of increasing market share. Market Basket Analysis can be
used to put together demographic and gentrification data to determine the location
of new stores or geo-targeted ads.
o Behaviour analysis: Understanding customer behaviour patterns is a primal stone in
the foundations of marketing. MBA can be used anywhere from a simple catalogue
design to UI/UX.
o Optimization of in-store operations: MBA is not only helpful in determining what
goes on the shelves but also behind the store. Geographical patterns play a key role
in determining the popularity or strength of certain products, and therefore, MBA has
been increasingly used to optimize inventory for each store or warehouse.
o Campaigns and promotions: Not only is MBA used to determine which products go
together but also about which products form keystones in their product line.
o Recommendations: OTT platforms like Netflix and Amazon Prime benefit from MBA
by understanding what kind of movies people tend to watch frequently.
Frequent Item sets
Let I = {I1, I2, . . . , Im} be a set of items. Let D, the task-relevant data, be a set of database
transactions where each transaction T is a nonempty set of items such that T ⊆ I. Each transaction is
associated with an identifier, called TID. Let A be a set of items. A transaction T is said to contain A if
A ⊆ T . An association rule is an implication of the form A ⇒ B, where A ⊂ I, B ⊂ I, A 6= ∅, B 6= ∅,
and A ∩ B = φ. The rule A ⇒ B holds in the transaction set D with support s, where s is the
percentage of transactions in D that contain A ∪ B (i.e., the union of sets A and B, or say, both A and
B). This is taken to be the probability, P(A ∪ B).1 The rule A ⇒ B has confidence c in the transaction
set D, where c is the percentage of transactions in D containing A that also contain B. This is taken to
be the conditional probability, P(B|A). That is,
confidence(A⇒B) = P(B|A).
Rules that satisfy both a minimum support threshold (min sup) and a minimum confidence
threshold (min conf) are called strong. By convention, we write support and confidence values so as
to occur between 0% and 100%, rather than 0 to 1.0. A set of items is referred to as an item set. 2 An
item set that contains k items is a k-item set. The set {computer, antivirus software} is a 2-itemset.
The occurrence frequency of an item set is the number of transactions that contain the item set. This
is also known, simply, as the frequency, support count, or count of the item set. Note that the item
set support defined in Equation is sometimes referred to as relative support, whereas the
occurrence frequency is called the absolute support. If the relative support of an item set I satisfies a
prespecified minimum support threshold (i.e., the absolute support of I satisfies the corresponding
minimum support count threshold), then I is a frequent item set. The set of frequent k-item sets is
commonly denoted by Lk.
Confidence (A⇒B) = P(B|A) = support(A ∪ B) /support(A) = support count(A ∪ B)/ support count(A) .
Equation shows that the confidence of rule A⇒B can be easily derived from the support counts of A
and A ∪ B. That is, once the support counts of A, B, and A ∪ B are found, it is straightforward to
derive the corresponding association rules A⇒B and B⇒A and check whether they are strong. Thus
the problem of mining association rules can be reduced to that of mining frequent item sets.
1. Find all frequent item sets: By definition, each of these item sets will occur at least as frequently
as a predetermined minimum support count, min sup.
2. Generate strong association rules from the frequent item sets: By definition, these rules must
satisfy minimum support and minimum confidence.
Additional interestingness measures can be applied for the discovery of correlation relationships
between associated items. Because the second step is much less costly than the first, the overall
performance of mining association rules is determined by the first step.
A major challenge in mining frequent item sets from a large data set is the fact that such mining
often generates a huge number of item sets satisfying the minimum support (min sup) threshold,
especially when min sup is set low. This is because if an item set is frequent, each of its subsets is
frequent as well. A long item set will contain a combinatorial number of shorter, frequent sub-item
sets. For example, a frequent item set of length 100, such as {a1, a2, . . . , a100}, contains 100 1 =
100 frequent 1-itemsets: {a1}, {a2}, . . . , {a100}, 100 2 frequent 2-itemsets: {a1, a2}, {a1, a3}, . . . ,
{a99, a100}, and so on. The total number of frequent item sets that it contains is thus,
This is too huge a number of item sets for any computer to compute or store. To overcome this
difficulty, we introduce the concepts of closed frequent item set and maximal frequent item set.
An item set X is closed in a data set D if there exists no proper super item set Y such that Y has the
same support count as X in D. An item set X is a closed frequent item set in set D if X is both closed
and frequent in D.
An item set X is a maximal frequent item set (or max-item set) in a data set D if X is frequent, and
there exists no super-item set Y such that X ⊂ Y and Y is frequent in D.
Let C be the set of closed frequent item sets for a data set D satisfying a minimum support threshold,
min sup. Let M be the set of maximal frequent item sets for D satisfying min sup. Suppose that we
have the support count of each item set in C and M. Notice that C and its count information can be
used to derive the whole set of frequent item sets. Thus we say that C contains complete
information regarding its corresponding frequent item sets. On the other hand, M registers only the
support of the maximal item sets. It usually does not contain the complete support information
regarding its corresponding frequent item sets. We illustrate these concepts with the following
example.
Closed and maximal frequent item sets. Suppose that a transaction database has only two
transactions: {ha1, a2, . . . , a100i; ha1, a2, . . . , a50i}. Let the minimum support count threshold be
min sup = 1. We find two closed frequent item sets and their support counts, that is, C = {{a1, a2, . . .
, a100} : 1; {a1, a2, . . . , a50} : 2}. There is only one maximal frequent item set: M = {{a1, a2, . . . ,
a100} : 1}. Notice that we cannot include {a1, a2, . . . , a50} as a maximal frequent item set because it
has a frequent super-set, {a1, a2, . . . , a100}. Compare this to the above, where we determined that
there are 2 100−1 frequent item sets, which is too huge a set to be enumerated!
The set of closed frequent item sets contains complete information regarding the frequent item sets.
For example, from C, we can derive, say, (1) {a2, a45 : 2} since {a2, a45} is a sub-item set of the item
set {a1, a2, . . . , a50 : 2}; and (2) {a8, a55 : 1} since {a8, a55} is not a sub-item set of the previous
item set but of the item set {a1, a2, . . . , a100 : 1}. However, from the maximal frequent item set, we
can only assert that both item sets ({a2, a45} and {a8, a55}) are frequent, but we cannot assert their
actual support counts.
Association Rule
Association rule learning is a type of unsupervised learning technique that checks for
the dependency of one data item on another data item and maps accordingly so that
it can be more profitable. It tries to find some interesting relations or associations
among the variables of dataset. It is based on different rules to discover the
interesting relations between variables in the database.
The association rule learning is one of the very important concepts of machine
learning, and it is employed in Market Basket analysis, Web usage mining,
continuous production, etc. Here market basket analysis is a technique used by the
various big retailer to discover the associations between items. We can understand it
by taking an example of a supermarket, as in a supermarket, all products that are
purchased together are put together.
For example, if a customer buys bread, he most likely can also buy butter, eggs, or
milk, so these products are stored within a shelf or mostly nearby. Consider the
below diagram:
1. Apriori
2. Eclat
3. F-P Growth Algorithm
o Support
o Confidence
o Lift
Support
Support is the frequency of A or how frequently an item appears in the dataset. It is
defined as the fraction of the transaction T that contains the itemset X. If there are X
datasets, then for transactions T, it can be written as:
Confidence
Confidence indicates how often the rule has been found to be true. Or how often the
items X and Y occur together in the dataset when the occurrence of X is already
given. It is the ratio of the transaction that contains X and Y to the number of records
that contain X.
Lift
It is the strength of any rule, which can be defined as below formula:
It is the ratio of the observed support measure and expected support if X and Y are
independent of each other. It has three possible values:
Apriori Algorithm
This algorithm uses frequent datasets to generate association rules. It is designed to
work on the databases that contain transactions. This algorithm uses a breadth-first
search and Hash Tree to calculate the itemset efficiently.
It is mainly used for market basket analysis and helps to understand the products
that can be bought together. It can also be used in the healthcare field to find drug
reactions for patients.
Eclat Algorithm
Eclat algorithm stands for Equivalence Class Transformation. This algorithm uses a
depth-first search technique to find frequent itemsets in a transaction database. It
performs faster execution than Apriori Algorithm.
Example:
“How can we further improve the efficiency of Apriori-based mining?” Many variations of the Apriori
algorithm have been proposed that focus on improving the efficiency of the original algorithm.
Several of these variations are summarized as follows:
Hash-based technique (hashing itemsets into corresponding buckets): A hash-based technique can
be used to reduce the size of the candidate kitemsets, Ck, for k > 1. For example, when scanning
each transaction i
FP-growth
The first scan of the database is the same as Apriori, which derives the set of frequent items (1-
itemsets) and their support counts (frequencies). Let the minimum support count be 2. The set of
frequent items is sorted in the order of descending support count. This resulting set or list is denoted
by L. Thus, we have L ={{I2: 7}, {I1: 6}, {I3: 6}, {I4: 2}, {I5: 2}}.
An FP-tree is then constructed as follows. First, create the root of the tree, labeled with “null.” Scan
database D a second time. The items in each transaction are processed inL order (i.e., sorted
according to descending support count), and a branchis created for each transaction. For example,
the scan of thefirst transaction, “T100: I1, I2, I5,” which contains three items (I2, I1, I5 in L order),
leads to the construction of the first branch of the tree with three nodes,hI2: 1i,hI1:1i, and hI5: 1i,
where I2islinked as a child to the root, I1islinked to I2, and I5islinked to I1. The second transaction,
T200, contains theitems I2 and I4inLorder, whichwould result in a branch where I2 is linked to the
root and I4 is linked to I2. However, this branch would share a common prefix, I2, with the existing
path for T100. Therefore, we instead increment the count of the I2 node by 1, and create a new
node,hI4: 1i, which is linked as a child to hI2: 2i. In general, when considering the branch to be added