Stin2044 Knowledge Discovery in Databases (Group A) SECOND SEMESTER SESSION 2020/2021 (A202) Group Assignment

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

STIN2044 KNOWLEDGE DISCOVERY IN DATABASES

(GROUP A)

SECOND SEMESTER SESSION 2020/2021 (A202)

GROUP ASSIGNMENT:

ASSIGNMENT 3

SUBMITTED TO:

PN NORAZIAH BINTI CHE PA

NO. NAME MATRIC NO

1 TOH WU WAYNN 261761

2 YONG WEI JING 261482

3 EMILY SIEW KE HUI 261398

4 YANG YUNFAN 257231

1.0 CLASSIFICATION
1.1SOURCE CODE
import pandas
import matplotlib.pyplot as plt
from matplotlib.pyplot import axis
import numpy as np
import csv
from sklearn.preprocessing._data import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.tree._classes import DecisionTreeClassifier
from sklearn import metrics
import seaborn as sns

# Load Data
dataset = pandas.read_csv('dataset.csv')

# Data Preprocessing
dataset.drop('id', inplace=True, axis=1)
dataset.dropna(axis=0, inplace=True)

# Split features from target


X = dataset[['clump_thickness', 'size_uniformity', 'shape_uniformity', 'marginal_adhesion',
'epithelial_size', 'bare_nucleoli', 'bland_chromatin', 'normal_nucleoli', 'mitoses']]
Y = dataset[['class']]

# Reslace datasets
sc = StandardScaler()
X = sc.fit_transform(X)

# Split dataset into training (80%) and test (20%) sets


X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=5)

# Create a new model and train it with the training dataset


clf = DecisionTreeClassifier(criterion='entropy')
model = clf.fit(X_train, Y_train)

# Use the trained model to test the testing dataset


predictions = model.predict(X_test).reshape(137,1)

# Calculate model accuracy


misclassified = (Y_test != predictions).sum()
accuracy = metrics.accuracy_score(Y_test, predictions)

print('\n--- RESULTS ---')


print('Misclassification count: ' + str(len(misclassified)))
print('Classification Accuracy: {:.3f}%\n'.format(accuracy))

1.2 SOURCE CODE OUTPUT

Figure 1: Source Code Output

2.0 CLUSTERING
For clustering, we are using the dataset of Mall Customer Segmentation Data that consists of 200
instances from kaggle. Customer Segmentation is a popular application of unsupervised learning. Using
clustering, identify segments of customers to target the potential user base. They divide customers into
groups according to common characteristics like gender, age, interests, and spending habits so they can
market to each group effectively. For this question we will perform clustering (K-means) using Orange
data mining tool and also visualizing the gender and age distributions. Then analyze their annual incomes
and spending scores.

Figure 2 : Widgets used in Orange


a. Discuss on the effect of number of clusters – how different value of k affects clustering

Figure 3: Clustering Output when k = 2

When k is 2, we are able to observe that the clustering is in fact working and effective in grouping up the
customers into 2 groups which are those who have a low spending score (C1) and the others who have a
high spending score (C2). This clustering is able to present us with valuable information but it is not
reaching its potential as it can still cluster the instances to represent much more valuable information.
Figure 4: Clustering Output when k = 6

Figure 4 shows the scatter plot of the data colored by the cluster they belong to when k is equal to 6. From
the figure above, it is obvious that we have 6 clusters. The yellow colored data (C5) which are the
customers who have a low annual income but have high spending score. The purple colored data (C6)
which are the customers who have a low income and low spending score. Next, for the customer who has
a high annual income and high spending score is plotted as green color (C3). Orange colored data (C4)
which are the customers who have a high annual income with a low spending score. For blue (C1) and red
(C2) colored data which are the customers who have an average annual income with average spending
score.
Figure 5: Clustering Output when k = 8

When k = 8, we are able to observe an over-analyzed clustering output. Based on Figure 4, the
information that we were able to obtain from the clustering output when k is 6 is present in this output as
well. When compared with the output from Figure 3, we can see that the clusters from when k is 6 are
being split into 2 clusters such as C4 and C8. Another example is the middle where C2, C5, C7 is being
clustered together which does not contribute to additional information that can be obtained from the
output.

To conclude, the performance of a clustering will be affected by the value of K. As the above plots have
shown, the clustering is not effective when k is either too low or high. When k is 2, information can be
obtained through the clustering. But when k is 8, the clustering is over-analyzed and over-complicated
with no significant addition of information. Therefore 6 is the sweet spot for the number of clusters for
this particular dataset where we are able to obtain the best and most effective clustering.
b. Discuss on the effect of initialization - how different initialization affects clustering

Figure 6: K-Means++ Initialization


Figure 7: Random Initialization
Based on figures 6 and 7, the clusters from Figure 6 are initialized with K-Means++ initialization while
the clusters from Figure 7 were initialized with Random initialization. What we can observe from both
figures is an identical cluster set even if it was initialized differently. The shapes and colours are set at
random and do not affect the clustering. Thus, we can conclude that the initialization plays little or no
effect in the outcome of the clustering of this particular dataset.
c. Provide graphical representation of the clustering outcome
Figure 8: Clustering Outcome

The K-means clustering algorithm is used to find groups which have not been explicitly labeled in the

data. This can be used to confirm business assumptions about what types of groups exist or to identify
unknown groups in complex data sets. The graph above shows that there are 6 clustering in different
Annual Income and Spending scores. The majority of the customers(C1 & C2) have 40-70 annual income
and they have a 40-60 spending score. Beside that, for C3 clustering is a cluster that has annual income
80-120 and has the highest spending score 80 - 100. C4 cluster shows that they have high annual income
and low spending score. C5 cluster shows that they have low annual income and high spending score and
through the data majority of them are female. Last but not least, C6 is the smallest cluster that has low
annual income and low spending power.

You might also like