Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Prediction of sales on Market Basket Data

using
Machine Learning Techniques

(Apriori and FP Growth)

Team Members:

16BCE0293 (Vikranth Reddy)


16BCE0176 (Y Abhinav Reddy )
16BCE0822 (Sidhant Hada )

Report submitted for the


Final Project Review of

Course Code: CSE 4022– Natural Language Processing

Slot: C1

Professor: Dr. Arivoli


Abstract:

This project describes the way of Market Basket Analysis implementation to an existing data
set. Data mining methods provide a lot of opportunities in the market sector as the amount of
data to be handled is growing rapidly. We derive the relations between items which display a
customer patter that is very specific and cannot be understood by traditional methods of data
handling. This information is useful in target marketing. Performing Market Basket Analysis
on the data set and determining the association rules between various items of the
transactions of each customer from a store and determining the relation between them. For
example, we may find out that if a customer is buying yogurt then he/she is more likely to
buy whole milk and so on. Market basket Analysis is nothing but finding the patterns of
customers purchase from a store and predicting the purchase of item sets (certain
combinations of items) from the existing data. We do the Analysis by using Apriori
Algorithm in r Studio and determining the customer purchase patterns. Also, we can also
classify the customers based on the purchases which will help the store owners to calculate
the quantity of certain goods to be ordered as per the patterns of customers purchase rather
than wastage of goods. This analysis will help the owners just by ordering a certain quantity
of items as per the requirements. In this project we also compare the working and accuracy
of FP Growth algorithm along with the Apriori algorithm on the iris dataset

Keywords: Prediction of Customers pattern, Classification, Market Basket


Analysis, Apriori, FP growth, iris dataset, Grocery dataset
1. Introduction:

Consider a retail store with a large inventory having various products. Classify these items
into sets, which are frequently bought together by the customers. Analyzing this sales data for
many customers can help in understanding a pattern. This project sits in an area where the
customers are are from an economically stable middle class background, with an annual
income of low to moderate. It is pressumed that most of these customers are regulars and
come back to the same shop to buy daily house hold needs. We determine those products that
are typically purchased together. Analysis of such data will enable a store owner to the items
in an orderly well planned manner and plan advertising campaigns with the goal to increase
sales.

In this project, we use support and confidence that looks beyond simple frequency with
which two or more items are bought together. We will use the popular Apriori algorithm for
discovering the association rules between these large data sets. It should be noted that our
methods are not limited to discovering patters in the retail market. It is more widely
applicable to an problem where individual items can be grouped together to make
meaningful data.

Fp Growth Algorithm (Frequent pattern growth). FP growth algorithm is an improvement of


apriori algorithm. FP growth algorithm used for finding frequent itemset in a transaction
database without candidate generation.

FP growth represents frequent items in frequent pattern trees or FP-tree.


2. Literature Review Summary Table

Kindly go through project and review papers related to your project and study them.
Minimum at least five projects/papers should be reviewed so that you have a
considerable understanding of what is achieved in your project area.

Limitations
Authors Concept / Methodology / Future
and Year Title Theoretical used/ Dataset Relevant Research/
(Referenc (Study) model/ Implementati details/ Finding Gaps
e) Framewor on Analysis identified
k

The
projected
algorithm Further
An Mathematical does not study on the
Akihiro Apriori- graph Carcinogene show computatio
Inokuchi, Based isomorphism, inflexible nal
sis data of
Takashi Algorithm graphical computatio effectiveness
Washio, for Mining representation 300 nal of AGM in
Hiroshi Frequent AGM of adjacency compounds complexity relation
from Oxford except in to the
Motoda Substructu matrix with
university cases theoretical
(18 July res from level wise
and NTP. where aspect is left
2002) Graph search, AGM graphs are for the
Data algorithm. of large size future study
in the
database.

Solution for
Iterative Feasible
Privacy- proportiona the
solution for
Xintao Aware l fitting, entailment
NP-
Wu, Ying Market graphical Frequent of frequent
Wu, Basket decompositi Itemset Market complete itemset
Yongge Data Set on of basket data problem of problem.
Wang, Generation independen mining and from data inverse
Yingjiu :A ce graph inverse generator by frequent set Generating
Li (2005) Feasible from frequent IBM Al- mining. market data
Approach frequent itemset
maden. Screening with
for Inverse item sets, mining
out frequency
Frequent privacy
aware confidential bounds of
Set Mining
generation. using itemset
heuristic available.
method.

Pattern
preservation
without
disclosure

Using a Results of
new tests Considerati
algorithm Synthetic (finding on of
which use data by large item pruning
Sergey Dynamic fewer passes IBM test sets, techniques
Itemset than classic for
Brin, data parallelism,
Rajeev Counting algorithm. Apriori generator incremental removing
and irrelevant
Motwani, Implicatio New way of algorithm and by US updates) on data.
Jeffrey D. n Rules generating and DIC census data both Developing
Ullman, for Market implication algorithm with an datasets overall
Shalom Basket rules average of justified measures of
Tsur 20 items DIC and
Data normalised difficulty for
chosen from census data
based on market-
1000 items. justified
antecedent based
implication
and datasets.
consequent rules.

Portfolio
A stock formulation
market from BSE-
Preeti 30. Extension
portfolio Fluctuation
Paranjap Time-lagged to intraday
recommen Association s in different
e-Voditel, Evaluating datasets trading
der system Rule Mining markets
Umesh and from BSE- using
based on (ARM) found using
Deshpan rebalancing 30. stream
association the system.
de portfolio mining
rule (from and
mining without
BSE-30)

Yen- Market Discovering Create the The effect of Exploring


Liang basket important synthetic the product the
analysis in
Chen, customer Apriori transactiona replacement strategies of
a multiple
Kwei purchase algorithm. l data sets rate is very generating
Tang, store patterns in by using the similar to the store-
Ren-Jie environme multi-store data that of the chain
Shen, Ya- nt. environmen generation numbers of association
Han Hu. t. Store algorithm periods and rules
chain projected by stores, and incremental
association Agrawal both factors ly, in an on-
rules. and Srikant are stronger line model,
than that of in a
the distributed
variation in environmen
store size t, or in
parallel
models.

3. Objective

Perform Market Basket Analysis on the data set and determining the association rules
between various items of the transactions of each customer from a store and determining
the relation between them.

We may find out that if a customer is buying yogurt then he/she is more likely to buy
whole milk and so on.

Market basket Analysis is nothing but finding the patterns of customers purchase from a store
and predicting the purchase of itemsets (certain combinations of items) from the existing data.
We do the Analysis by using Apriori Algorithm and FP Growth algorithm in R Studio and
determining the customer purchase patterns and also concluding which method is better.

Also, we can also classify the customers based on the purchases which will help the store
owners to calculate the quantity of certain goods to be ordered as per the patterns of
customers purchase rather than wastage of goods.

This analysis will help the owners just by ordering a certain quantity of items as per
the requirements.
4. Innovation component in the project:

The project helps the shopkeeper to analyse the sales and predict the optimal amount of
grocery items which are needed in the future. This helps in reducing the wastage and has
a very high future scope.

It also helps the shopkeeper to determine which machine learning technique to implement in
order to get the best result.

It also provides the shopkeeper with an option of choosing whether to utilize specific
method of generating association rules or a generic one.

4. Proposed work and implementation

Methodology adapted:

Step 1 : Take a data set which contains the transactions of various customers of a
store and the Iris dataset.

Step 2 : Use the data set and apply machine learning algorithms (Apriori
Algorithm and FP growth) in R Studio .

Step 3 : Get the associative rules of the items based on the results after applying
algorithms.

Step 4 : With the help of associative rules visually represent the rules indicating
the probabilities of requirements of that specific item.

Software requirements:

R STUDIO
Code

Importing the data set into the R Studio :

> rules<-read.transactions("C:/Users/Y ABHINAV

REDDY/Desktop/groceries.csv",sep=",")

> rules

transactions in sparse format with

9835 transactions (rows) and

169 items (columns)

Knowing about the data set :

> summary(rules)

transactions as itemMatrix in sparse format

with 9835 rows (elements/itemsets/transactions)

and

169 columns (items) and a density of

0.02609146 most frequent items:

whole milk other vegetables rolls/buns soda

2513 1903 1809 1715

yogurt (Other)

1372 34055

element (itemset/transaction) length distribution:

sizes

1 2 3 4 5 6 7 8 9 10 11 12 13

14 15 16
2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77
55 46

17 18 19 20 21 22 23 24 26 27 28 29 32

29 14 14 9 11 4 6 1 1 1 1 3 1

Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 2.000 3.000 4.409 6.000

32.000 includes extended item information - examples: labels

1 abrasive cleaner

2 artif. sweetener

3 baby cosmetics

Viewing the first 3 itemsets:

> inspect(rules[1:3])

items

[1] {citrus fruit, margarine, ready soups, semi-finished bread}

[2] {coffee, tropical fruit, yogurt}

[3] {whole milk}

Step2 : Finding the Frequency (occurance) of items in the whole transactions

and this can be done by :

> itemFrequency(rules[,1])

abrasive cleaner

0.003558719

Here above we got the percentage of occurance of abrasive cleaner in each

transactionI.e 0.00355 . And totally,we have 9835 transactions, thus the frequency

of the item is :
> 0.00355*9835

[1] 34.91425(which is approximately 35 times)

Plotting the ItemFrequency :

> itemFrequency(rules[,1:6])

abrasive cleaner artif. sweetener baby cosmetics baby food

0.0035587189 0.0032536858 0.0006100661

0.0001016777 bags baking powder

0.0004067107 0.0176919166

> itemFrequencyPlot(rules[,1:6])

> itemFrequencyPlot(rules,support=0.10)

> rules<-apriori(data=Groceries, parameter=list(supp=0.001,conf = 0.15,minlen=2),

Apriori Parameter specification:

confidence minval smax arem aval originalSupport maxtime

support minlen

0.15 0.1 1 none FALSE TRUE 5

0.001 2

maxlen target ext 10

rules FALSE

Algorithmic control:

filter tree heap memopt load sort verbose

0.1 TRUE TRUE FALSE TRUE 2 TRUE

Absolute minimum support count: 9


set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[169 item(s), 9835
transaction(s)] done [0.00s]. sorting and recoding items ... [157 item(s)] done [0.00s].
creating transaction tree ... done [0.00s]. checking subsets of size 1 2 3 4 5 6 done [0.01s].
writing ... [26820 rule(s)] done [0.02s]. creating S4 object ... done [0.01s]. > rules

set of 26820 rules

//Here on using the given conditions , we generated a set of 26820

rules . But we don’t want so many rules so we change the support and

confidence :

> rules <- apriori(Groceries, parameter = list(supp = 0.001, conf = 0.8))

Apriori

Parameter specification:

confidence minval smax arem aval originalSupport maxtime support

minlen 0.8 0.1 1 none FALSE TRUE 5 0.001

maxlen target ext 10

rules FALSE

Algorithmic control:

filter tree heap memopt load sort verbose

0.1 TRUE TRUE FALSE TRUE 2 TRUE

Absolute minimum support count: 9

set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[169 item(s), 9835
transaction(s)] done [0.00s]. sorting and recoding items ... [157 item(s)] done [0.00s].
creating transaction tree ... done [0.00s]. checking subsets of size 1 2 3 4 5 6 done [0.01s].
writing ... [410 rule(s)] done [0.00s]. creating S4 object ... done [0.00s]. > rules

set of 410 rules


> inspect(rules[1:5])//Inspecting the first 5 transactions.

lhs rhs support confidence lift

[1] {liquor,red/blush wine} => {bottled beer} 0.001931876 0.9047619 11.235269

[2] {curd,cereals} => {whole milk} 0.001016777 0.9090909

3.557863

[3] {yogurt,cereals} => {whole milk} 0.001728521 0.8095238

3.168192

[4] {butter,jam} => {whole milk} 0.001016777 0.8333333

3.261374

[5] {soups,bottled beer} => {whole milk} 0.001118454 0.9166667

3.587512 count

[1] 19

[2] 10

[3] 17

[4] 10

[5] 11

Now we sort the given data based on decreasing confidence :

> rules<-sort(rules, by="confidence", decreasing=TRUE)

> inspect(rules[1:5])

lhs rhs support confidence lift

count

[1] {rice, sugar} => {whole milk} 0.001220132 1 3.913649

12
[2] {canned fish, hygiene articles} => {whole milk} 0.001118454 1 3.913649

11

[3] {root vegetables, butter, rice} => {whole milk} 0.001016777 1 3.913649

10

[4] {root vegetables, whipped/sour cream, flour} => {whole milk} 0.001728521 1 3.913649

17

[5] {butter,

soft cheese, domestic eggs} => {whole milk} 0.001016777 1

3.913649 10

There is a possibility that the same association rules might be repeated, thus in order
to remove that we are performing the following

operations. subset.matrix <- is.subset(rules, rules)

subset.matrix[lower.tri(subset.matrix, diag=T)] <-

NA redundant <- colSums(subset.matrix, na.rm=T)

>= 1 rules.pruned <- rules[!redundant] rules<-

rules.pruned

Suppose if we want to know the association rules of any specific item , this step helps to
find all the possible rules with respect to the

association rules. For example , if we want to know all the association rules related

to whole milk in the rhs we perform the following :

> rules<-apriori(data=Groceries, parameter=list(supp=0.001,conf = 0.08),appearance


= list(default="lhs",rhs="whole milk"),control = list(verbose=F))

> rules<-sort(rules, decreasing=TRUE,by="confidence")

> inspect(rules[1:5])
lhs rhs support confidence lift

count

[1] {rice,

sugar} => {whole milk} 0.001220132 1

3.913649 12

[2] {canned fish, hygiene articles} => {whole milk} 0.001118454 1 3.913649

11

[3] {root vegetables, butter, rice} => {whole milk} 0.001016777 1 3.913649

10

[4] {root vegetables, whipped/sour cream, flour} => {whole milk} 0.001728521 1 3.913649

17

[5] {butter, soft cheese, domestic eggs} => {whole milk} 0.001016777 1 3.913649

10

Example 2: If we need to know all the association rules with yogurt on

lhs :

> rules<-apriori(data=Groceries, parameter=list(supp=0.001,conf = 0.15,minlen=2),


+ appearance = list(default="rhs",lhs="yogurt"), + control = list(verbose=F))

> rules<-sort(rules, decreasing=TRUE,by="confidence")

> inspect(rules[1:5])

lhs rhs support confidence lift count

[1] {yogurt} => {whole milk} 0.05602440 0.4016035 1.571735 551

[2] {yogurt} => {other vegetables} 0.04341637 0.3112245 1.608457 427

[3] {yogurt} => {rolls/buns} 0.03436706 0.2463557 1.339363 338


[4] {yogurt} => {tropical fruit} 0.02928317 0.2099125 2.000475 288

[5] {yogurt} => {soda} 0.02735130 0.1960641 1.124368 269

Classification based on the purchase patterns of the customers can

be

visualized and come to know regarding the trends of customers. Example 1: Let us assume
a customer is buying yogurt , by using

classification we would able to tell , which product is he/she will buy .

> library(arulesViz)

> plot(rules,method="graph",interactive=TRUE,shading=NA)

FP growth on iris data set

ibrary("rCBA")

data("iris")

train <- sapply(iris,as.factor)

train <- data.frame(train, check.names=FALSE)

txns <- as(train,"transactions")

rules = rCBA::fpgrowth(txns, support=0.03, confidence=0.03, maxLength=2,


consequent="Species",

parallel=FALSE)

inspect(rules[1:6])

predictions <- rCBA::classification(train,rules)

table(predictions)

sum(as.character(train$Species)==as.character(predictions),na.rm=TRUE)/length(predictions)
prunedRules <- rCBA::pruning(train, rules, method="m2cba", parallel=FALSE)
predictions <- rCBA::classification(train, prunedRules)

table(predictions)

sum(as.character(train$Species)==as.character(predictions),na.rm=TRUE)/length(predictions)

plot(rules[1:15],method="graph",interactive=TRUE,shading=NA)

APRIORI(ON IRIS DATA SET) :

library(arules)

library(arulesViz)

library(datasets)

rules<-apriori(data=iris, parameter=list(supp=0.001,conf = 0.08),appearance

= list(default="lhs",rhs="Species=setosa"),control = list(verbose=F))

plot(rules[1:10],method="graph",interactive=TRUE,shading=NA)

Screenshots and Demo

1. Apriori on Grocery Dataset (Market Basket Analysis)


2. FP Growth VS Apriori on Iris Dataset
5. Dataset used:

The data set which we are using is the default groceries set(which contains approximately
9835 transaction of customers) which is already embedded in “arules” package in R Studio.

We also use the dataset IRIS Dataset that is taken from the UCI Repository online. It has
150 instances and 4 attributes,

6. Results and Discussion

Classification of customers.
Prediction of the approximate sales of certain grocery items.
Visual representations of the associative rules among grocery items.
Visual representation of the association rules formed from the Iris dataset after using
both Apriori and FP growth algorithm.
Comparison of Apriori and FP growth algorithm and finding which is more optimal.
It was found out that FP Growth is better than Apriori as it is faster, no
candidate generation and also has only two passes over the dataset.
FP Growth is more specific in determining the association rules when compared
to Apriori.
Hence, it was concluded that FP Growth is an optimal solution for
predicting association rules.
Thus after visualizing the graphical representation of each item of the list of
items sold , we would be able to find the patterns of the customers in the store.

7. References:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.165.9796&
rep=rep1&type=pdf
Ngai, E. W., Xiu, L., & Chau, D. C. (2009). Application of data mining techniques
in customer relationship management: A literature review and classification. Expert
systems with applications, 36(2), 2592-2602.(APA Format)
http://www.salemmarafi.com/code/market-basket-analysis-with-
r/ https://rpubs.com/emzak208/281776
https://epubs.siam.org/doi/abs/10.1137/1.9781611972757.10
https://link.springer.com/content/pdf/10.1007%2F3-540-45372-5_2.pdf
https://www.sciencedirect.com/science/article/pii/S0167923604000685
Borgelt, C. (2005, August). An Implementation of the FP-growth Algorithm. In Proceedings of
the 1st international workshop on open source data mining: frequent pattern
mining implementations (pp. 1-5). ACM.

You might also like