Professional Documents
Culture Documents
aml unit 3
aml unit 3
Association rule mining finds interesting associations and relationships among large sets of
data items. This rule shows how frequently a itemset occurs in a transaction. A typical
example is a Market Based Analysis. Market Based Analysis is one of the key techniques
used by large relations to show associations between items.It allows retailers to identify
relationships between the items that people buy together frequently.
Here the If element is called antecedent, and then statement is called as Consequent. These
types of relationships where we can find out some association or relation between two items
is known as single cardinality. It is all about creating rules, and if the number of items
increases, then cardinality also increases accordingly. So, to measure the associations
between thousands of data items, there are several metrics. These metrics are given below:
o Support
o Confidence
o Lift
Confidence
Confidence indicates how often the rule has been found to be true. Or how often the items X
and Y occur together in the dataset when the occurrence of X is already given. It is the ratio
of the transaction that contains X and Y to the number of records that contain X.
Lift
It is the strength of any rule, which can be defined as below formula:
It is the ratio of the observed support measure and expected support if X and Y are
independent of each other. It has three possible values:
1. Apriori
2. Eclat
3. F-P Growth Algorithm
Association Rule Learning (ARL) is a rule-based machine learning technique used to find
patterns in data. The Apriori Algorithm is used while the Association Rule Learning takes
place. Apriori is a basket analysis method used to reveal product associations.
The Apriori algorithm uses frequent itemsets to generate association rules, and it is designed
to work on the databases that contain transactions. With the help of these association rule, it
determines how strongly or how weakly two objects are connected. This algorithm was given
by the R. Agrawal and Srikant in the year 1994. It is mainly used for market basket
analysis and helps to find those products that can be bought together.
Step-1: Determine the support of itemsets in the transactional database, and select the
minimum support and confidence.
Step-2: Take all supports in the transaction with higher support value than the minimum or
selected support value.
Step-3: Find all the rules of these subsets that have higher confidence value than the
threshold or minimum confidence.
previous step.
The association rule is {X -> Y} where {X} and {Y} are the set of elements. This association
rule means that if all items in {X} are in one basket, {Y} will “likely” be in that basket.
{X} is also called the left-hand side of the antecedent or association rule (LHS:
left-hand-side),
{Y} is called the right-hand side of the consecutive or association rule (RHS:
right-hand-side).
An example attribution rule for X grocery items would be {Potato, Onion} -> {Bread}, so if
Potatoes and Onions {X} are purchased, customers will most likely also buy Bread {Y}. It
should be noted that the symbol “ -> ” does not indicate a causal relationship between {X}
and {Y}. This symbol merely reveals an estimate of the conditional probability of {Y} given
{X}.
Figure 3: Formula for support, confidence and lift for the association rule X ⟹ Y
1 .Support (X, Y): It is the probability of X and Y products being seen together.
Support (X, Y) = Freq (X, Y) / N
Here N shows us the total number of transactions. If we open the expression, support is
calculated by dividing the number of purchasers of X and Y products by the total number of
transactions. In comparative models, the higher the support value, the better. The support
value must be greater than the threshold values. Those below are eliminated and the next step
is taken.
We are assuming that minimum support count is 2 and minimum confidence is 50%.
Step 1: Create a table which has support count of all the items present in the transaction
database.
We will compare each item’s support count with the minimum support count we have set. If
the support count is less than minimum support count then we will remove those items.
Since I4 was discarded in previous one, so we are not taking any superset having I4
Now, remove all those itemset which has support count less than minimum support count. So,
the final dataset will be
Step 3: Find superset with 3 items in each set present in last transaction dataset. Check all the
subset of an itemset which are frequent or not and remove the infrequent ones.
In this case if we select { I1, I2, I3 } we must have all the subset that is,
{ I1, I2 }, { I2, I3 }, { I1, I3 }. But we don’t have { I1, I3 } in our dataset. Same is true for {
I1, I3, I5 } and { I2, I3, I5 }.
4. I2 -> I1
5. I3 -> I2
6. I5 -> I2
Since, All these association rules has confidence ≥50% then all can be considered as strong
association rules.
Step 5: We will calculate lift for all the strong association rules.
It means that there is 25% chance that the customers who buy I1 are likely to buy I2.
Cosine Similarity
distance is less, there will be a high degree of similarity, but when the distance is large,
If this
there will be a low degree of similarity. Some of the popular similarity measures are –
1. Euclidean Distance.
2. Manhattan Distance.
3. Jaccard Similarity.
4. Minkowski Distance.
5. Cosine Similarity.
Cosine similarity is a metric, helpful in determining, how similar the data objects are
irrespective of their size. We can measure the similarity between two sentences in
Python using Cosine Similarity. In cosine similarity, data objects in a dataset are treated as a
vector. The formula to find the cosine similarity between two vectors is –
(x, y) = x . y / ||x|| ||y||
where,
x . y = product (dot) of the vectors ‘x’ and ‘y’.
||x|| and ||y|| = length (magnitude) of the two vectors ‘x’ and ‘y’.
||x|| ||y|| = regular product of the two vectors ‘x’ and ‘y’.
Example : Consider an example to find the similarity between two vectors – ‘x’ and ‘y’,
using Cosine Similarity. The ‘x’ vector has values, x = { 3, 2, 0, 5 } The ‘y’ vector has
values, y = { 1, 0, 0, 0 } The formula for calculating the cosine similarity is : (x, y) = x . y /
||x|| ||y||
x . y = 3*1 + 2*0 + 0*0 + 5*0 = 3
Advantages :
The cosine similarity is beneficial because even if the two similar data objects are
far apart by the Euclidean distance because of the size, they could still have a
smaller angle between them. Smaller the angle, higher the similarity.
When plotted on a multi-dimensional space, the cosine similarity captures the
orientation (the angle) of the data objects and not the magnitude.
The Jaccard Similarity Index is a measure of the similarity between two sets of data.
Developed by Paul Jaccard, the index ranges from 0 to 1. The closer to 1, the more similar
the two sets of data.
If two datasets share the exact same members, their Jaccard Similarity Index will be 1.
Conversely, if they have no members in common then their similarity will be 0.
The following examples show how to calculate the Jaccard Similarity Index for a few
different datasets.
Example 1: Jaccard Similarity
To calculate the Jaccard Similarity between them, we first find the total number of
observations in both sets, then divide by the total number of observations in either set:
Number of observations in both: {0, 2, 5, 9} = 4
Number of observations in either: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} = 10
Jaccard Similarity: 4 / 10 = 0.4
To calculate the Jaccard Similarity between them, we first find the total number of
observations in both sets, then divide by the total number of observations in either set:
Number of observations in both: {} = 0
Number of observations in either: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10} = 11
Jaccard Similarity: 0 / 11 = 0
The Jaccard Similarity Index turns out to be 0. This indicates that the two datasets share no
common members.
Example 3: Jaccard Similarity for Characters
Note that we can also use the Jaccard Similarity index for datasets that contain characters as
opposed to numbers.
To calculate the Jaccard Similarity between them, we first find the total number of
observations in both sets, then divide by the total number of observations in either set:
Number of observations in both: {‘monkey’} = 1
Number of observations in either: {‘cat’, ‘dog’, hippo’, ‘monkey’, ‘rhino’, ‘ostrich’,
‘salmon’} = 7
Jaccard Similarity: 1 / 7= 0.142857
The Jaccard Similarity Index turns out to be 0.142857. Since this number is fairly low, it
indicates that the two sets are quite dissimilar.
The Jaccard Distance
The Jaccard distance measures the dissimilarity between two datasets and is calculated as:
For example, if two datasets have a Jaccard Similarity of 80% then they would have a Jaccard
distance of 1 – 0.8 = 0.2 or 20%.
Surprise Library:
The “Surprise” library is a Python library designed specifically for developing recommender
systems easily and efficiently. It provides a collection of collaborative filtering algorithms,
which are techniques used to generate recommendations based on user behavior patterns and
preferences. The library has the purpose of making the creation of recommender systems
easier, where the implementation and evaluation of the algorithms is facilitated. This is a very
useful tool for recommendation systems, this can be used in marketing as well as in digital
commerce.
For real-world implementations, we need a more extensive library which hides all the
implementation details and provides abstract Application Programming Interfaces (APIs) to
build recommender systems. Surprise is a Python library for accomplishing this.
The Surprise library is a Python library designed to help you build and evaluate recommender
systems, which are systems that suggest items to users (like movies, products, or books)
based on their preferences and behaviors. Surprise simplifies the process of developing these
systems using collaborative filtering techniques.
Simple Example
Here’s a step-by-step example to demonstrate how to use the Surprise library to build a
simple recommender system:
python
data = Dataset.load_builtin('ml-100k')
Step 4: Choose an Algorithm
Choose an algorithm to train your recommender system. Here, we'll use the SVD (Singular
Value Decomposition) algorithm.
algo = SVD()
Output
This output shows the RMSE and MAE for each of the five folds, as well as the mean and
standard deviation across all folds.
Conclusion
The Surprise library makes it easy to develop and evaluate recommender systems using
collaborative filtering techniques. By following these simple steps, you can quickly build a
recommender system and assess its performance using built-in tools and datasets.
SVD is a factorization method for a matrix that decomposes it into three other matrices. It's
commonly used for dimensionality reduction and noise reduction.
import numpy as np
from sklearn.decomposition import TruncatedSVD
# Sample matrix
X = np.array([[3, 4, 3], [1, 2, 3], [4, 6, 4]])
# Perform SVD
svd = TruncatedSVD(n_components=2) # Reduce to 2 dimensions
X_svd = svd.fit_transform(X)
print("Original Matrix:\n", X)
print("SVD Transformed Matrix:\n", X_svd)
output:
Original Matrix:
[[3 4 3]
[1 2 3]
[4 6 4]]
SVD Transformed Matrix:
[[ 5.82430443 -0.18033773]
[ 3.47734928 1.38118615]
[ 8.23238015 -0.45582502]]
import numpy as np
from sklearn.decomposition import NMF
# Sample matrix
X = np.array([[3, 4, 3], [1, 2, 3], [4, 6, 4]])
# Perform NMF
nmf = NMF(n_components=2, init='random', random_state=0)
W = nmf.fit_transform(X)
H = nmf.components_
print("Original Matrix:\n", X)
print("NMF Transformed Matrix (W):\n", W)
print("NMF Components Matrix (H):\n", H)
output:
Original Matrix:
[[3 4 3]
[1 2 3]
[4 6 4]]
NMF Transformed Matrix (W):
[[0.36765554 1.16307885]
[1.40759631 0. ]
[0.35854835 1.73810126]]
NMF Components Matrix (H):
[[0.7207332 1.41216128 2.13355718]
[2.21418567 3.10877869 1.87475062]]
import numpy as np
from pmf import PMF
# Initialize PMF
pmf = PMF(num_factors=2, num_iters=100, learning_rate=0.001, reg=0.01)
print("Original Matrix:\n", R)
print("Predicted Matrix:\n", predictions)