Unit IV

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

UNIT IV CLUSTERING

K-Means Clustering Algorithm

It is an iterative unsupervised learning algorithm that divides the unlabeled dataset into k different
clusters in such a way that each dataset belongs only one group that has similar properties. The main
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.

The k-means clustering algorithm mainly performs two tasks:


o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.

Steps in K-Means algorithm:


Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of
each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given
below:

1
o Let's take K=2, to group datasets into two different clusters.
o Now, we have to choose some random k points or centroid to form the cluster. These points
can be either the points from the dataset or any other point. So, here we are selecting the
below two points as k points, which are not the part of our dataset.

o We will assign each data point of the scatter plot to its closest K-point or centroid. For this,
we have to calculate the distance between two points. So, we will draw a median between
both the centroids.

From the above image, it is clear that points left side of the line is near to the K1 or
blue centroid, and points to the right of the line are close to the yellow centroid.

o As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these

2
centroids, and will find new centroids as below:

o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and two
blue points are right to the line. So, these three points will be assigned to new centroids.

o We will repeat the process by finding the center of gravity of centroids, so the new centroids
will be as shown in the below image:

3
o As we got the new centroids so again will draw the median line and reassign the data points.
So, the image will be:

o We can see in the above image; there are no dissimilar data points on either side of the line,
which means our model is formed. Consider the below image:

As our model is ready, so we can now remove the assumed centroids, and the two final
clusters will be as shown in the below image:

4
Decision of K- Value
The performance of the K-means clustering algorithm depends upon highly efficient clusters that it
forms. But choosing the optimal number of clusters is a big task. There are some different ways to
find the optimal number of clusters.

Elbow Method
This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares,
which defines the total variations within a cluster. The formula to calculate the value of WCSS (for 3
clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2

∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point and
its centroid within a cluster1 and the same for the other two terms.

To measure the distance between data points and centroid, we can use any method such as
Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
o It executes the K-means clustering on a given dataset for different K values (ranges from 1-
10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is considered
as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the elbow
method.

HANDLING NON-NUMERICAL DATA (TITANIC DATASET)


Nonnumeric data types are data that cannot be manipulated mathematically using standard
arithmetic operators. The non-numeric data comprises text or string data types, the Date data types,
the Boolean data types that store only two values (true or false), Object data type and Variant data
type .

import matplotlib.pyplot as plt


from matplotlib import style
style.use('ggplot')
import numpy as np
from sklearn.cluster import KMeans
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import pandas as pd

5
df = pd.read_excel('D:\ML & Python\Book3.xlsx')
print(df.head())

pclass survived ... body home.dest


0 1 1 ... NaN St Louis, MO
1 1 1 ... NaN Montreal, PQ / Chesterville, ON
2 1 0 ... NaN Montreal, PQ / Chesterville, ON
3 1 0 ... 135.0 Montreal, PQ / Chesterville, ON
4 1 0 ... NaN Montreal, PQ / Chesterville, ON

[5 rows x 14 columns]

There are many ways to handle for non-numerical data. First, we will want to cycle through the
columns in the Pandas dataframe. For columns that are not numbers, we want to find their unique
elements. This can be done by simply take a set of the column values. From here, the index within
that set can be the new "numerical" value or "id" of the text data.

df.drop(['body','name'], 1, inplace=True)
df.fillna(0, inplace=True)
print(df.head())

pclass survived gender ... embarked boat home.dest


0 1 1 female ... S 2 St Louis, MO
1 1 1 male ... S 0 Montreal, PQ / Chesterville, ON
2 1 0 female ... S 0 Montreal, PQ / Chesterville, ON
3 1 0 male ... S 0 Montreal, PQ / Chesterville, ON
4 1 0 female ... S 0 Montreal, PQ / Chesterville, ON

[5 rows x 12 columns]

def handle_non_numerical_data(df):
columns = df.columns.values
for column in columns:
text_digit_vals = {}
def convert_to_int(val):
return text_digit_vals[val]

if df[column].dtype != np.int64 and df[column].dtype != np.float64:


column_contents = df[column].values.tolist()
unique_elements = set(column_contents)
x=0
for unique in unique_elements:
if unique not in text_digit_vals:
text_digit_vals[unique] = x
x+=1

df[column] = list(map(convert_to_int, df[column]))

return df

df = handle_non_numerical_data(df)

6
print(df.head())

pclass survived sex age sibsp parch ticket fare cabin \


0 1 1 1 29.0000 0 0 767 211.3375 80
1 1 1 0 0.9167 1 2 531 151.5500 149
2 1 0 1 2.0000 1 2 531 151.5500 149
3 1 0 0 30.0000 1 2 531 151.5500 149
4 1 0 1 25.0000 1 2 531 151.5500 149

embarked boat home.dest


0 1 1 307
1 1 27 43
2 1 0 43
3 1 0 43
4 1 0 43

K-MEANS WITH TITANIC DATASET


Using K-Means algorithm, we want the algorithm finds survivors and non-survivors mostly in the two
groups.

X = np.array(df.drop(['survived'], 1).astype(float))
X = preprocessing.scale(X)
y = np.array(df['survived'])

clf = KMeans(n_clusters=2)
clf.fit(X)

correct = 0
for i in range(len(X)):
predict_me = np.array(X[i].astype(float))
predict_me = predict_me.reshape(-1, len(predict_me))
prediction = clf.predict(predict_me)
if prediction[0] == y[i]:
correct += 1

print(correct/len(X))
0.7081741787624141

K-MEANS IN PYTHON

X = np.array([[1, 2],
[1.5, 1.8],
[5, 8 ],
[8, 8],
[1, 0.6],
[9,11]])

plt.scatter(X[:,0], X[:,1], s=150)


plt.show()

7
We take K=2. We build our K Means class:

class K_Means:
def __init__(self, k=2, tol=0.001, max_iter=300):
self.k = k
self.tol = tol
self.max_iter = max_iter
The tol value is our tolerance, which will optimize if the centroid is not moving more than the
tolerance value. The max_iter value is to limit the number of cycles we will run.

Now we'll build the fit method:

def fit(self,data):

self.centroids = {}

for i in range(self.k):
self.centroids[i] = data[i]

To begin, use an empty dictionary, which will contain centroids at end of program. Next, we begin a
for loop which assigns our starting centroids as the first two data samples in our data.

for i in range(self.max_iter):
self.classifications = {}

for i in range(self.k):
self.classifications[i] = []

Full Python Code-

import matplotlib.pyplot as plt


from matplotlib import style
style.use('ggplot')
import numpy as np

X = np.array([[1, 2],
[1.5, 1.8],
[5, 8 ],
[8, 8],
[1, 0.6],

8
[9,11]])

plt.scatter(X[:,0], X[:,1], s=150)


plt.show()

colors = 10*["g","r","c","b","k"]

class K_Means:
def __init__(self, k=2, tol=0.001, max_iter=300):
self.k = k
self.tol = tol
self.max_iter = max_iter

def fit(self,data):

self.centroids = {}

for i in range(self.k):
self.centroids[i] = data[i]

for i in range(self.max_iter):
self.classifications = {}

for i in range(self.k):
self.classifications[i] = []

Next, we need to iterate through our features, calculate distances of the features to the current
centroids, and classify them as such:

for featureset in data:


distances = [np.linalg.norm(featureset-self.centroids[centroid]) for centroid in
self.centroids]
classification = distances.index(min(distances))
self.classifications[classification].append(featureset)

prev_centroids = dict(self.centroids)

for classification in self.classifications:


self.centroids[classification] = np.average(self.classifications[classification],axis=0)

optimized = True

for c in self.centroids:
original_centroid = prev_centroids[c]
current_centroid = self.centroids[c]
if np.sum((current_centroid-original_centroid)/original_centroid*100.0) > self.tol:
print(np.sum((current_centroid-original_centroid)/original_centroid*100.0))
optimized = False

if optimized:
break

9
def predict(self,data):
distances = [np.linalg.norm(data-self.centroids[centroid]) for centroid in self.centroids]
classification = distances.index(min(distances))
return classification

Now, call this K-Means class to check the output


clf = K_Means()
clf.fit(X)

for centroid in clf.centroids:


plt.scatter(clf.centroids[centroid][0], clf.centroids[centroid][1],
marker="o", color="k", s=150, linewidths=5)

for classification in clf.classifications:


color = colors[classification]
for featureset in clf.classifications[classification]:
plt.scatter(featureset[0], featureset[1], marker="x", color=color, s=150, linewidths=5)

plt.show()

HIERARCHICAL CLUSTERING ALGORITHM (HCA)


• Hierarchical clustering is another unsupervised machine learning algorithm, which is
used to group the unlabeled datasets into a cluster.
• In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this
tree-shaped structure is known as the dendrogram.
• Two approaches
o Bottom Up - Algorithm starts with taking all data points as single clusters and
merging them until one cluster is left called as Agglomerative. It does this
until all the clusters are merged into a single cluster that contains all the
datasets.
o Top Down- Divisive algorithm is the reverse of the agglomerative algorithm.

Steps Of Hierarchichal Agglomerative Algorithm


1. Place each data in its own singleton group
2. Iteratively merge the two closest groups.

10
3. Repeat step 2 until all the data points are merged into single cluster.
4. We obtain a dendogram at the final step. We cut the dendogram at a certain level to obtain
the final set of clusters.

The closest distance between the two clusters is crucial for the hierarchical clustering. There are
various ways to calculate the distance between two clusters, and these ways decide the rule for
clustering. These measures are called Linkage methods.
1. Single Linkage: It is the Shortest Distance between the closest points of the clusters.

2. Complete Linkage: It is the farthest distance between the two points of two different
clusters. It is one of the popular linkage methods as it forms tighter clusters than single-
linkage.

3. Average Linkage: It is the linkage method in which the distance between each pair of
datasets is added up and then divided by the total number of datasets to calculate the
average distance between two clusters. It is also one of the most popular linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance between the centroid of the
clusters is calculated.

11
MEAN SHIFT CLUSTERING
Mean Shift is a powerful clustering algorithm used in unsupervised learning. Unlike K-means
clustering, it does not make any assumptions; hence it is a non-parametric algorithm.
Mean-shift algorithm basically assigns the datapoints to the clusters iteratively by shifting points
towards the highest density of datapoints i.e. cluster centroid.

The difference between K-Means algorithm and Mean-Shift is that later one does not need to
specify the number of clusters in advance because the number of clusters will be determined by the
algorithm w.r.t data.

It is based on Kernel Density Estimation(KDE), which is a way to estimate the probability density
function of a random variable. KDE is a problem where the inferences of the population are made by
data smoothing. It works by providing weights to each data point. The weight function is called a
kernel. There are many kinds of kernels- Gaussian kernel, rectangular kernel, flat kernel, etc. Adding
all those kernels together creates a density function (probability surface).

The KDE surface where our data points are distributed in the surface plot. The hills can be
considered as the kernel.

We can make the points climb uphill to the nearest peak on the KDE surface. So, iteratively shifting
each point to climb uphill to the peak. The bandwidth parameter used to make the KDE surface
varies on the different sizes. For example, we have a tall skinny kernel which means a small kernel
bandwidth and in a case where the size of the kernel is short and fat, which means a large kernel
bandwidth. A small kernel bandwidth makes the KDE surface hold the peak for every data point
more formally, saying each point has its cluster; on the other hand, large kernel bandwidth results in
fewer kernels or fewer clusters.

Steps of Mean-Shift Algorithm

Step 1 – Define a window ( bandwidth of the Kernel to be used for estimation) and place the
window on a data point.
Step 2 – Find mean of all the points within the window.
Step 3 – Move the window to the location of the mean.
Step 4 – Repeat step 2-3 until convergence.

On Convergence, all data points within that window form a cluster.

12
NAÏVE BAYES CLASSIFIER
o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o It is a probabilistic classifier, which means it predicts on the basis of the probability of an
object.
o The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
• Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple without
depending on each other.
• Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:

where,
• P(A|B) is Posterior probability. P(A | B) is the probability of event A, given the event B is true
(has occured). Event B is also termed as evidence.
• P(B|A) is Likelihood probability. P(B | A) is the probability of B given event A, i.e. probability
of event B after evidence A is seen.
• P(A) is Prior Probability. P(A) is the priori of A (the prior independent probability, i.e.
probability of event before evidence is seen).
• P(B) is Marginal Probability

To understand working of Naives Bayes Classifier, consider below example.

Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using
this dataset we need to decide that whether we should play or not on a particular day according to
the weather conditions. So to solve this problem, we need to follow the below steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:

Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

13
3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

13 Overcast Yes

Frequency table for the Weather Conditions:


Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 4

Likelihood table weather condition:


Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71


Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
P(Sunny)= 0.35
P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

14
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:


o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:


o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.

Applications of Naïve Bayes Classifier:


o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.

NAÏVE BAYES CLASSIFIER WITH SCIKIT

# Gaussian Naive Bayes


from sklearn import datasets
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
# load the iris datasets
dataset = datasets.load_iris()
# fit a Naive Bayes model to the data
model = GaussianNB()

model.fit(dataset.data, dataset.target)
print(model)
# make predictions
expected = dataset.target
predicted = model.predict(dataset.data)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

O U T P UT :
p r e c i s io n r e c a l l f1 - s co r e s u p po r t

0 1.00 1.00 1.00 50


1 0.94 0.94 0.94 50
2 0.94 0.94 0.94 50

15
a cc u r a c y 0.96 150
m a c ro av g 0.96 0.96 0.96 150
w e i g h t e d av g 0 .9 6 0.96 0.96 150
[[50 0 0]
[ 0 47 3]
[ 0 3 47]]

Text Classification /using Naive Bayes

Text Analysis is a major application field for machine learning algorithms. However the raw data, a
sequence of symbols (i.e. strings) cannot be fed directly to the algorithms themselves as most of them
expect numerical feature vectors with a fixed size rather than the raw text documents with variable
length.

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem.

The dataset is divided into two parts, namely, feature matrix and the response/target vector.
• The Feature matrix (X) contains all the vectors(rows) of the dataset in which each vector consists
of the value of dependent features. The number of features is d i.e. X = (x1,x2,x2, xd).
• The Response/target vector (y) contains the value of class/group variable for each row of
feature matrix.

The main two assumptions of Naive Bayes


Naive Bayes assumes that each feature/variable of the same class makes an:
• independent
• equal contribution to the outcome.

The assumptions made by Naive Bayes are not generally correct in real-world situations. In-fact, the
independence assumption is often not meet and this is why it is called “Naive” i.e. because it assumes
something that might not be true.

The Naive Bayes Model


Given a data matrix X and a target vector y, we state our problem as:

where, y is class variable and X is a dependent feature vector with dimension d i.e. X = (x1,x2,x2,
xd), where d is the number of variables/features of the sample.
• P(y|X) is the probability of observing the class y given the sample X with X = (x1,x2,x2,
xd), where d is the number of variables/features of the sample.

Now the “naïve” conditional independence assumptions come into play: assume that all features
in X are mutually independent, conditional on the category y:

Finally, to find the probability of a given sample for all possible values of the class variable y, we just
need to find the output with maximum probability:

16
Dealing with text data
The raw data is a collection of strings.
Text Analysis is a major application field for machine learning algorithms. However the raw data, a
sequence of symbols (i.e. strings) cannot be fed directly to the algorithms themselves as most of
them expect numerical feature vectors with a fixed size rather than the raw text documents with
variable length.
In order to address this, scikit-learn provides utilities for the most common ways to extract numerical
features from text content, namely:
• tokenizing strings and giving an integer id for each possible token, for instance by using white-
spaces and punctuation as token separators.
• counting the occurrences of tokens in each document.
In this scheme, features and samples are defined as follows:
• each individual token occurrence frequency is treated as a feature.
• the vector of all the token frequencies for a given document is considered a multivariate sample.
“Counting” Example (to really understand this before we move on):

from sklearn.feature_extraction.text import CountVectorizer


corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
[‘and’, ‘document’, ‘first’, ‘is’, ‘one’, ‘second’, ‘the’, ‘third’, ‘this’]

print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]

Implementation in Python
Here we consider a multi-class (20 classes) text classification problem.
First, we will load all the necessary libraries:
import numpy as np, pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix, accuracy_score

Next, let’s load the data (training and test sets):


# Load the dataset
data = fetch_20newsgroups()

17
# Get the text categories
text_categories = data.target_names
# define the training set
train_data = fetch_20newsgroups(subset="train", categories=text_categories)
# define the test set
test_data = fetch_20newsgroups(subset="test", categories=text_categories)

Let’s find out how many classes and samples we have:


print("We have {} unique classes".format(len(text_categories)))
print("We have {} training samples".format(len(train_data.data)))
print("We have {} test samples".format(len(test_data.data)))

Output:
We have 20 unique classes
We have 11314 training samples
We have 7532 test samples

So, this is a 20-class text classification problem with n_train = 11314 training samples (text
sentences) and n_test = 7532 test samples (text sentences).

Let’s visualize the 5th training sample:


# let’s have a look as some training data
print(test_data.data[5])
As mentioned previously, our data are texts (more specifically, emails) so we should see something
like the following printed out:

18
The next step consists of building the Naive Bayes classifier and finally training the model. In our
example, we will convert the collection of text documents (train and test sets) into a matrix of token
counts.

To implement that text transformation we will use the make_pipeline function. This will internally
transform the text data and then the model will be fitted using the transformed data.

# Build the model


model = make_pipeline(TfidfVectorizer(), MultinomialNB())
# Train the model using the training data
model.fit(train_data.data, train_data.target)
# Predict the categories of the test data
predicted_categories = model.predict(test_data.data)

The last line of code predicts the labels of the test set.

Let’s see the predicted categories names:


print(np.array(test_data.target_names)[predicted_categories])
array(['rec.autos', 'sci.crypt', 'alt.atheism', ..., 'rec.sport.baseball', 'comp.sys.ibm.pc.hardware',
'soc.religion.christian'], dtype='<U24')

Finally, let’s build the multi-class confusion matrix to see if the model is good or if the model predicts
correctly only specific text categories.
# plot the confusion matrix
mat = confusion_matrix(test_data.target, predicted_categories)
print("The accuracy is {}".format(accuracy_score(test_data.target, predicted_categories)))
The accuracy is 0.7738980350504514

19
From the above confusion matrix, we can verify that the model is really good.
• It is able to correctly predict all 20 classes of the text data (most values are on the diagonal and
few are off-the-diagonal).
• We also notice that the highest miss-classification (value off-the-diagonal) is 131. The value 131
means that 131 documents that belonged to the “religion miscellaneous ” category were miss-
classified as belonging to the “religion christian” category.

20

You might also like