Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

wAmer Seva Madal

GOVINDRAO WANJARI COLLEGE OF ENGINEERING &


TECHNOLOGY, NAGPUR.
B.Tech. Seven Semester (Information Technology) (C.B.C.S.)
Data Warehousing & Mining (BTIT701T)
Prof. Swapnil R. Dikshit

Market basket analysis:

This process analyzes customer buying habits by finding associations between the
different items that customers place in their shopping baskets. The discovery of such
associations can help retailers develop marketing strategies by gaining insight into
which items are frequently purchased together by customers. For instance, if
customers are buying milk, how likely are they to also buy bread (and what kind of
bread) on the same trip to the supermarket. Such information can lead to increased
sales by helping retailers do selective marketing and plan their shelf space.
Example:

If customers who purchase computers also tend to buy antivirus software at the same
time, then placing the hardware display close to the software display may help increase
the sales of both items. In an alternative strategy, placing hardware and software at
opposite ends of the store may entice customers who purchase such items to pick up
other items along the way. For instance, after deciding on an expensive computer,a
customer may observe security systems for sale while heading toward the software
display to purchase antivirus software and may decide to purchase a home security
system as well. Market basket analysis can also help retailers plan which items to put
on sale at reduced prices. If customers tend to purchase computers and printers
together, then having a sale on printers may encourage the sale of printers as well as
computers.
Market Basket Analysis in Data Mining
Market basket analysis is a data mining technique used by retailers to increase sales
by better understanding customer purchasing patterns. It involves analyzing large
data sets, such as purchase history, to reveal product groupings and products that
are likely to be purchased together.

The adoption of market basket analysis was aided by the advent of electronic point-
of-sale (POS) systems. Compared to handwritten records kept by store owners, the
digital records generated by POS systems made it easier for applications to process
and analyze large volumes of purchase data.

Implementation of market basket analysis requires a background in statistics and


data science and some algorithmic computer programming skills. For those without
the needed technical skills, commercial, off-the-shelf tools exist.

One example is the Shopping Basket Analysis tool in Microsoft Excel, which analyzes
transaction data contained in a spreadsheet and performs market basket analysis. A
transaction ID must relate to the items to be analyzed. The Shopping Basket Analysis
tool then creates two worksheets:

o The Shopping Basket Item Groups worksheet, which lists items that are frequently
purchased together,
o And the Shopping Basket Rules worksheet shows how items are related (For example,
purchasers of Product A are likely to buy Product B).

How does Market Basket Analysis Work?


Market Basket Analysis is modelled on Association rule mining, i.e., the IF {}, THEN {}
construct. For example, IF a customer buys bread, THEN he is likely to buy butter as
well.

Association rules are usually represented as: {Bread} -> {Butter}

Some terminologies to familiarize yourself with Market Basket Analysis are:

o Antecedent: Items or 'itemsets' found within the data are antecedents. In simpler
words, it's the IF component, written on the left-hand side. In the above example,
bread is the antecedent.
o Consequent: A consequent is an item or set of items found in combination with the
antecedent. It's the THEN component, written on the right-hand side. In the above
example, butter is the consequent.

Types of Market Basket Analysis


Market Basket Analysis techniques can be categorized based on how the available
data is utilized. Here are the following types of market basket analysis in data mining,
such as:

1. Descriptive market basket analysis: This type only derives insights from past data
and is the most frequently used approach. The analysis here does not make any
predictions but rates the association between products using statistical techniques.
For those familiar with the basics of Data Analysis, this type of modelling is known as
unsupervised learning.
2. Predictive market basket analysis: This type uses supervised learning models like
classification and regression. It essentially aims to mimic the market to analyze what
causes what to happen. Essentially, it considers items purchased in a sequence to
determine cross-selling. For example, buying an extended warranty is more likely to
follow the purchase of an iPhone. While it isn't as widely used as a descriptive MBA, it
is still a very valuable tool for marketers.
3. Differential market basket analysis: This type of analysis is beneficial for competitor
analysis. It compares purchase history between stores, between seasons, between
two time periods, between different days of the week, etc., to find interesting
patterns in consumer behaviour. For example, it can help determine why some users
prefer to purchase the same product at the same price on Amazon vs Flipkart. The
answer can be that the Amazon reseller has more warehouses and can deliver faster,
or maybe something more profound like user experience.

Algorithms associated with Market Basket Analysis


In market basket analysis, association rules are used to predict the likelihood of
products being purchased together. Association rules count the frequency of items
that occur together, seeking to find associations that occur far more often than
expected.

Algorithms that use association rules include AIS, SETM and Apriori. The Apriori
algorithm is commonly cited by data scientists in research articles about market
basket analysis. It identifies frequent items in the database and then evaluates their
frequency as the datasets are expanded to larger sizes.

R's rules package is an open-source toolkit for association mining using the R
programming language. This package supports the Apriori algorithm and other
mining algorithms, including arulesNBMiner, opusminer, RKEEL and RSarules.

With the help of the Apriori Algorithm, we can further classify and simplify the item
sets that the consumer frequently buys. There are three components in APRIORI
ALGORITHM:

o SUPPORT
o CONFIDENCE
o LIFT

For example, suppose 5000 transactions have been made through a popular e-
Commerce website. Now they want to calculate the support, confidence, and lift for
the two products. For example, let's say pen and notebook, out of 5000 transactions,
500 transactions for pen, 700 transactions for notebook, and 1000 transactions for
both.

SUPPORT

It has been calculated with the number of transactions divided by the total number
of transactions made,

1. Support = freq(A, B)/N


support(pen) = transactions related to pen/total transactions

i.e support -> 500/5000=10 percent

CONFIDENCE

Whether the product sales are popular on individual sales or through combined sales
has been calculated. That is calculated with combined transactions/individual
transactions.

1. Confidence = freq (A, B)/ freq(A)

Confidence = combine transactions/individual transactions

i.e confidence-> 1000/500=20 percent

LIFT

Lift is calculated for knowing the ratio for the sales.

1. Lift = confidence percent/ support percent

Lift-> 20/10=2

When the Lift value is below 1, the combination is not so frequently bought by
consumers. But in this case, it shows that the probability of buying both the things
together is high when compared to the transaction for the individual items sold.

Examples of Market Basket Analysis


Here are the following examples that explore Market Basket Analysis by market
segment, such as:
o Retail: The most well-known MBA case study is Amazon.com. Whenever you view a
product on Amazon, the product page automatically recommends, "Items bought
together frequently." It is perhaps the simplest and most clean example of an MBA's
cross-selling techniques.
Apart from e-commerce formats, BA is also widely applicable to the in-store retail
segment. Grocery stores pay meticulous attention to product placement based and
shelving optimization. For example, you are almost always likely to find shampoo and
conditioner placed very close to each other at the grocery store. Walmart's infamous
beer and diapers association anecdote is also an example of Market Basket Analysis.
o Telecom: With the ever-increasing competition in the telecom sector, companies are
paying close attention to customers' services. For example, Telecom has now started
to bundle TV and Internet packages apart from other discounted online services to
reduce churn.
o IBFS: Tracing credit card history is a hugely advantageous MBA opportunity for IBFS
organizations. For example, Citibank frequently employs sales personnel at large
malls to lure potential customers with attractive discounts on the go. They also
associate with apps like Swiggy and Zomato to show customers many offers they can
avail of via purchasing through credit cards. IBFS organizations also use basket
analysis to determine fraudulent claims.
o Medicine: Basket analysis is used to determine comorbid conditions and symptom
analysis in the medical field. It can also help identify which genes or traits are
hereditary and which are associated with local environmental effects.
Benefits of Market Basket Analysis
The market basket analysis data mining technique has the following benefits, such as:

o Increasing market share: Once a company hits peak growth, it becomes challenging
to determine new ways of increasing market share. Market Basket Analysis can be
used to put together demographic and gentrification data to determine the location
of new stores or geo-targeted ads.
o Behaviour analysis: Understanding customer behaviour patterns is a primal stone in
the foundations of marketing. MBA can be used anywhere from a simple catalogue
design to UI/UX.
o Optimization of in-store operations: MBA is not only helpful in determining what
goes on the shelves but also behind the store. Geographical patterns play a key role
in determining the popularity or strength of certain products, and therefore, MBA has
been increasingly used to optimize inventory for each store or warehouse.
o Campaigns and promotions: Not only is MBA used to determine which products go
together but also about which products form keystones in their product line.
o Recommendations: OTT platforms like Netflix and Amazon Prime benefit from MBA
by understanding what kind of movies people tend to watch frequently.
Frequent Item sets
Let I = {I1, I2, . . . , Im} be a set of items. Let D, the task-relevant data, be a set of database
transactions where each transaction T is a nonempty set of items such that T ⊆ I. Each transaction is
associated with an identifier, called TID. Let A be a set of items. A transaction T is said to contain A if
A ⊆ T . An association rule is an implication of the form A ⇒ B, where A ⊂ I, B ⊂ I, A 6= ∅, B 6= ∅,
and A ∩ B = φ. The rule A ⇒ B holds in the transaction set D with support s, where s is the
percentage of transactions in D that contain A ∪ B (i.e., the union of sets A and B, or say, both A and
B). This is taken to be the probability, P(A ∪ B).1 The rule A ⇒ B has confidence c in the transaction
set D, where c is the percentage of transactions in D containing A that also contain B. This is taken to
be the conditional probability, P(B|A). That is,

support (A⇒B) = P(A ∪ B)

confidence(A⇒B) = P(B|A).

Rules that satisfy both a minimum support threshold (min sup) and a minimum confidence
threshold (min conf) are called strong. By convention, we write support and confidence values so as
to occur between 0% and 100%, rather than 0 to 1.0. A set of items is referred to as an item set. 2 An
item set that contains k items is a k-item set. The set {computer, antivirus software} is a 2-itemset.
The occurrence frequency of an item set is the number of transactions that contain the item set. This
is also known, simply, as the frequency, support count, or count of the item set. Note that the item
set support defined in Equation is sometimes referred to as relative support, whereas the
occurrence frequency is called the absolute support. If the relative support of an item set I satisfies a
prespecified minimum support threshold (i.e., the absolute support of I satisfies the corresponding
minimum support count threshold), then I is a frequent item set. The set of frequent k-item sets is
commonly denoted by Lk.

From Equation we have

Confidence (A⇒B) = P(B|A) = support(A ∪ B) /support(A) = support count(A ∪ B)/ support count(A) .

Equation shows that the confidence of rule A⇒B can be easily derived from the support counts of A
and A ∪ B. That is, once the support counts of A, B, and A ∪ B are found, it is straightforward to
derive the corresponding association rules A⇒B and B⇒A and check whether they are strong. Thus
the problem of mining association rules can be reduced to that of mining frequent item sets.

In general, association rule mining can be viewed as a two-step process:

1. Find all frequent item sets: By definition, each of these item sets will occur at least as frequently
as a predetermined minimum support count, min sup.

2. Generate strong association rules from the frequent item sets: By definition, these rules must
satisfy minimum support and minimum confidence.

Additional interestingness measures can be applied for the discovery of correlation relationships
between associated items. Because the second step is much less costly than the first, the overall
performance of mining association rules is determined by the first step.
A major challenge in mining frequent item sets from a large data set is the fact that such mining
often generates a huge number of item sets satisfying the minimum support (min sup) threshold,
especially when min sup is set low. This is because if an item set is frequent, each of its subsets is
frequent as well. A long item set will contain a combinatorial number of shorter, frequent sub-item
sets. For example, a frequent item set of length 100, such as {a1, a2, . . . , a100}, contains 100 1 =
100 frequent 1-itemsets: {a1}, {a2}, . . . , {a100}, 100 2 frequent 2-itemsets: {a1, a2}, {a1, a3}, . . . ,
{a99, a100}, and so on. The total number of frequent item sets that it contains is thus,

100 1 + 100 2 + · · · + 100 100= 2100 − 1 ≈ 1.27 × 1030 .

This is too huge a number of item sets for any computer to compute or store. To overcome this
difficulty, we introduce the concepts of closed frequent item set and maximal frequent item set.

An item set X is closed in a data set D if there exists no proper super item set Y such that Y has the
same support count as X in D. An item set X is a closed frequent item set in set D if X is both closed
and frequent in D.

An item set X is a maximal frequent item set (or max-item set) in a data set D if X is frequent, and
there exists no super-item set Y such that X ⊂ Y and Y is frequent in D.
Let C be the set of closed frequent item sets for a data set D satisfying a minimum support threshold,
min sup. Let M be the set of maximal frequent item sets for D satisfying min sup. Suppose that we
have the support count of each item set in C and M. Notice that C and its count information can be
used to derive the whole set of frequent item sets. Thus we say that C contains complete
information regarding its corresponding frequent item sets. On the other hand, M registers only the
support of the maximal item sets. It usually does not contain the complete support information
regarding its corresponding frequent item sets. We illustrate these concepts with the following
example.

Closed and maximal frequent item sets. Suppose that a transaction database has only two
transactions: {ha1, a2, . . . , a100i; ha1, a2, . . . , a50i}. Let the minimum support count threshold be
min sup = 1. We find two closed frequent item sets and their support counts, that is, C = {{a1, a2, . . .
, a100} : 1; {a1, a2, . . . , a50} : 2}. There is only one maximal frequent item set: M = {{a1, a2, . . . ,
a100} : 1}. Notice that we cannot include {a1, a2, . . . , a50} as a maximal frequent item set because it
has a frequent super-set, {a1, a2, . . . , a100}. Compare this to the above, where we determined that
there are 2 100−1 frequent item sets, which is too huge a set to be enumerated!

The set of closed frequent item sets contains complete information regarding the frequent item sets.
For example, from C, we can derive, say, (1) {a2, a45 : 2} since {a2, a45} is a sub-item set of the item
set {a1, a2, . . . , a50 : 2}; and (2) {a8, a55 : 1} since {a8, a55} is not a sub-item set of the previous
item set but of the item set {a1, a2, . . . , a100 : 1}. However, from the maximal frequent item set, we
can only assert that both item sets ({a2, a45} and {a8, a55}) are frequent, but we cannot assert their
actual support counts.

Association Rule
Association rule learning is a type of unsupervised learning technique that checks for
the dependency of one data item on another data item and maps accordingly so that
it can be more profitable. It tries to find some interesting relations or associations
among the variables of dataset. It is based on different rules to discover the
interesting relations between variables in the database.

The association rule learning is one of the very important concepts of machine
learning, and it is employed in Market Basket analysis, Web usage mining,
continuous production, etc. Here market basket analysis is a technique used by the
various big retailer to discover the associations between items. We can understand it
by taking an example of a supermarket, as in a supermarket, all products that are
purchased together are put together.

For example, if a customer buys bread, he most likely can also buy butter, eggs, or
milk, so these products are stored within a shelf or mostly nearby. Consider the
below diagram:

Association rule learning can be divided into three types of algorithms:

1. Apriori
2. Eclat
3. F-P Growth Algorithm

We will understand these algorithms in later chapters.


How does Association Rule Learning work?
Association rule learning works on the concept of If and Else Statement, such as if A
then B.

Here the If element is called antecedent, and then statement is called


as Consequent. These types of relationships where we can find out some association
or relation between two items is known as single cardinality. It is all about creating
rules, and if the number of items increases, then cardinality also increases
accordingly. So, to measure the associations between thousands of data items, there
are several metrics. These metrics are given below:

o Support
o Confidence
o Lift

Let's understand each of them:

Support
Support is the frequency of A or how frequently an item appears in the dataset. It is
defined as the fraction of the transaction T that contains the itemset X. If there are X
datasets, then for transactions T, it can be written as:

Confidence
Confidence indicates how often the rule has been found to be true. Or how often the
items X and Y occur together in the dataset when the occurrence of X is already
given. It is the ratio of the transaction that contains X and Y to the number of records
that contain X.
Lift
It is the strength of any rule, which can be defined as below formula:

It is the ratio of the observed support measure and expected support if X and Y are
independent of each other. It has three possible values:

o If Lift= 1: The probability of occurrence of antecedent and consequent is


independent of each other.
o Lift>1: It determines the degree to which the two itemsets are dependent to each
other.
o Lift<1: It tells us that one item is a substitute for other items, which means one item
has a negative effect on another.

Types of Association Rule Lerning


Association rule learning can be divided into three algorithms:

Apriori Algorithm
This algorithm uses frequent datasets to generate association rules. It is designed to
work on the databases that contain transactions. This algorithm uses a breadth-first
search and Hash Tree to calculate the itemset efficiently.

It is mainly used for market basket analysis and helps to understand the products
that can be bought together. It can also be used in the healthcare field to find drug
reactions for patients.

Eclat Algorithm
Eclat algorithm stands for Equivalence Class Transformation. This algorithm uses a
depth-first search technique to find frequent itemsets in a transaction database. It
performs faster execution than Apriori Algorithm.

F-P Growth Algorithm


The F-P growth algorithm stands for Frequent Pattern, and it is the improved
version of the Apriori Algorithm. It represents the database in the form of a tree
structure that is known as a frequent pattern or tree. The purpose of this frequent
tree is to extract the most frequent patterns.

Applications of Association Rule Learning


It has various applications in machine learning and data mining. Below are some
popular applications of association rule learning:

o Market Basket Analysis: It is one of the popular examples and applications of


association rule mining. This technique is commonly used by big retailers to
determine the association between items.
o Medical Diagnosis: With the help of association rules, patients can be cured easily,
as it helps in identifying the probability of illness for a particular disease.
o Protein Sequence: The association rules help in determining the synthesis of artificial
Proteins.
o It is also used for the Catalog Design and Loss-leader Analysis and many more
other applications.

Frequent Item set Mining Methods: TheApriori Algorithm

Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994


for mining frequent itemsets for Boolean association rules.
The name of the algorithm is based on the fact that the algorithm uses
prior knowledge of frequent itemset properties.
Apriori employs an iterative approach known as a level-wise search,
where k-itemsets are used to explore (k+1)-itemsets.
First, the set of frequent 1-itemsets is found by scanning the database to
accumulate the count for each item, and collecting those items that
satisfy minimum support. The resulting set is denoted L1.Next, L1 is used
to find L2, the set of frequent 2-itemsets, which is used to find L3, and so
on, until no more frequent k-itemsets can be found.
The finding of each Lkrequires one full scan of the database.
A two-step process is followed in Aprioriconsisting of joinand prune action.
Example:

TID List of item IDs


T100 I1, I2, I5
T200 I2, I4
T300 I2, I3
T400 I1, I2, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3

There are nine transactions in this database, that is, |D| = 9.


Steps:
1. In the first iteration of the algorithm, each item is a member of the set of
candidate1- item sets, C1. The algorithm simply scans all of the transactions in order
to count the number of occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, min sup = 2. The set
of frequent 1-itemsets, L1, can then be determined. It consists of the candidate 1-
itemsets satisfying minimum support. In our example, all of the candidates in C1
satisfy minimum support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1
on L1to generate a candidate set of 2-itemsets, C2.No candidates are removed fromC2
during the prune step because each subset of the candidates is also frequent.
4. Next, the transactions in Dare scanned and the support count of each candidate
itemsetInC2 isaccumulated.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those
candidate2- item sets in C2 having minimum support.
6. The generation of the set of candidate 3-itemsets,C3, From the join step, we first
getC3 =L2x L2 = ({I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},{I2, I3, I5}, {I2, I4, I5}. Based
on the Apriori property that all subsets of a frequent item set must also be frequent,
we can determine that the four latter candidates cannot possibly be frequent.

7. The transactions in D are scanned in order to determine L3, consisting of those


candidate3-itemsets in C3 having minimum support.
8. The algorithm uses L3x L3 to generate a candidate set of 4-itemsets, C4.
Generating Association Rules from Frequent Itemsets:
Once the frequent itemsets from transactions in a database D have
been found, it is straightforward to generate strong association rules
from them.

Example:

Mining Multilevel Association Rules:

For many applications, it is difficult to find strong associations among


data items at low or primitive levels of abstraction due to the sparsity of
data at those levels.
Strong associations discovered at high levels of abstraction may
represent commonsense knowledge.
Therefore, data mining systems should provide capabilities for mining
association rules at multiple levels of abstraction, with sufficient
flexibility for easy traversal among different abstraction spaces.

Association rules generated from mining data at multiple levels of abstraction


arecalled multiple-level or multilevel association rules.
Multilevel association rules can be mined efficiently using concept
hierarchies under a support-confidence framework.
In general, a top-down strategy is employed, where counts are
accumulated for the calculation of frequent itemsets at each concept level,
starting at the concept level 1 and working downward in the hierarchy
toward the more specific concept levels,until no more frequent itemsets
can be found.

A concepthierarchy defines a sequence of mappings froma set of low-level


concepts to higherlevel,more general concepts. Data can be generalized by
replacing low-level conceptswithin the data by their higher-level concepts, or
ancestors, from a concept hierarchy.
The concept hierarchy has five levels, respectively referred to as levels 0to 4,
starting with level 0 at the root node for all.

Here, Level 1 includes computer, software, printer&camera, and


computer accessory.

Level 2 includes laptop computer, desktop computer, office software,


antivirus software

Level 3 includes IBM desktop computer, . . . , Microsoft office software,


and so on.
Level 4 is the most specific abstraction level of this hierarchy.

Improving the Efficiency of Apriori

“How can we further improve the efficiency of Apriori-based mining?” Many variations of the Apriori
algorithm have been proposed that focus on improving the efficiency of the original algorithm.
Several of these variations are summarized as follows:
Hash-based technique (hashing itemsets into corresponding buckets): A hash-based technique can
be used to reduce the size of the candidate kitemsets, Ck, for k > 1. For example, when scanning
each transaction i
FP-growth

The first scan of the database is the same as Apriori, which derives the set of frequent items (1-
itemsets) and their support counts (frequencies). Let the minimum support count be 2. The set of
frequent items is sorted in the order of descending support count. This resulting set or list is denoted
by L. Thus, we have L ={{I2: 7}, {I1: 6}, {I3: 6}, {I4: 2}, {I5: 2}}.

An FP-tree is then constructed as follows. First, create the root of the tree, labeled with “null.” Scan
database D a second time. The items in each transaction are processed inL order (i.e., sorted
according to descending support count), and a branchis created for each transaction. For example,
the scan of thefirst transaction, “T100: I1, I2, I5,” which contains three items (I2, I1, I5 in L order),
leads to the construction of the first branch of the tree with three nodes,hI2: 1i,hI1:1i, and hI5: 1i,
where I2islinked as a child to the root, I1islinked to I2, and I5islinked to I1. The second transaction,
T200, contains theitems I2 and I4inLorder, whichwould result in a branch where I2 is linked to the
root and I4 is linked to I2. However, this branch would share a common prefix, I2, with the existing
path for T100. Therefore, we instead increment the count of the I2 node by 1, and create a new
node,hI4: 1i, which is linked as a child to hI2: 2i. In general, when considering the branch to be added

Fig: An FP-tree registers compressed, frequent pattern information.


Mining Frequent Itemsets Using Vertical Data Format

You might also like