Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

Cluster Analysis on PCA on Wholesale Customers data

PGP/24/111
Sarbani Mishra

Principal Component Analysis:

PCA is also known as exploratory data analysis, which is used to reduce the features. To
decrease the complexity, based on the loading( correlation) of each items with each group,
features are clubbed together for simplicity.
In our case, PCA was done with features: Milk, grocery, detergent, fresh, frozen and
delicatessen.

Deletion criteria :
1. If eigen value <1.
2. If communality of an item (square of the horizontal loadings)< 0.4
3. When loading of an item for each group is less than 0.4
4. When cross-loading is present.

From the elbow diagram, it is evident that by grouping in more than 2 groups will lead to
eigen values less than 1, and hence nfactors=2 or No. of features=2 is taken

By using feature reduction method, 2 groups were made


RC1- Milk+Grocery+Detergent
RC2- Fresh+ Froze +Delicatessen.
Total Cumulative variance: 72% is explained by RC1 and RC2.
where the values of RC1 and RC2 are weighted loads of linear relationship of features.

Varimax: This is used to get independent groups , i.e PC1 and PC2 have totally different 
features.

Algorithm :
1. Import Data - We have used the Customer wholesale data, Which is to be analysed to
create the Basket.
2. Creating Test data and Train data : We have divide the data into 80-20 for the Cross-
validation after the training.
3. Analysis using package “Psych” : We have used the Psych package, which is
specially used for psychological analysis, and also Principal Component Analysis
(PCA). The PCA was used to determine the EigenValue and Commonality to
understand the products which commonly into the basket together.

Observation:  In the first component, the Detergents_Paper, Milk and Grocery are the
most strongly correlated with original features. In second component, Fresh and Frozen
have the strongest correlations.

In the above figure, where the “nfactors=2” , Number of features= 2, where the features are
combined in 2 groups 

Fig 2

In the above figure, where the “nfactors=3” , Number of features= 3, where the features are
combined in 3 groups. But the third group RC3 has only one feature of delicacy, though it has
higher variance than nfactor=2, the analysis will be more wholesome here.

Finally the data scores of RC1 and RC2 are stored in test and training data.
Hierarchical Analysis on PCA:

With the reduced features, cluster analysis with hierarchical clustering is done. This is done
to analyse the target customer. Basically, cluster 1 and 2 are customer groups which is made
based on basket which they have chosen.

From the below figure, it was evident that the no. of clusters for customers should be 2.
From the below diagram it was evident that maximum of the customers lies in cluster 1, which
consists pf 349/352 customers. Due to biasness the cluster 1 has more members.

No. of members in the Cluster:


Correlations between RC1 and RC2 to its customers

As the number of data points (customers data transactions ) for Group 2 is lower and it shows
high affinity towards RC2 , it can be inferred that group 2 majorly preferred RC2 basket of
Fresh+ Froze +Delicatessen.

You might also like