Professional Documents
Culture Documents
Unit IV
Unit IV
Unit IV
It is an iterative unsupervised learning algorithm that divides the unlabeled dataset into k different
clusters in such a way that each dataset belongs only one group that has similar properties. The main
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given
below:
1
o Let's take K=2, to group datasets into two different clusters.
o Now, we have to choose some random k points or centroid to form the cluster. These points
can be either the points from the dataset or any other point. So, here we are selecting the
below two points as k points, which are not the part of our dataset.
o We will assign each data point of the scatter plot to its closest K-point or centroid. For this,
we have to calculate the distance between two points. So, we will draw a median between
both the centroids.
From the above image, it is clear that points left side of the line is near to the K1 or
blue centroid, and points to the right of the line are close to the yellow centroid.
o As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
2
centroids, and will find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two
blue points are right to the line. So, these three points will be assigned to new centroids.
o We will repeat the process by finding the center of gravity of centroids, so the new centroids
will be as shown in the below image:
3
o As we got the new centroids so again will draw the median line and reassign the data points.
So, the image will be:
o We can see in the above image; there are no dissimilar data points on either side of the line,
which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final
clusters will be as shown in the below image:
4
Decision of K- Value
The performance of the K-means clustering algorithm depends upon highly efficient clusters that it
forms. But choosing the optimal number of clusters is a big task. There are some different ways to
find the optimal number of clusters.
Elbow Method
This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares,
which defines the total variations within a cluster. The formula to calculate the value of WCSS (for 3
clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point and
its centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such as
Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
o It executes the K-means clustering on a given dataset for different K values (ranges from 1-
10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is considered
as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the elbow
method.
5
df = pd.read_excel('D:\ML & Python\Book3.xlsx')
print(df.head())
[5 rows x 14 columns]
There are many ways to handle for non-numerical data. First, we will want to cycle through the
columns in the Pandas dataframe. For columns that are not numbers, we want to find their unique
elements. This can be done by simply take a set of the column values. From here, the index within
that set can be the new "numerical" value or "id" of the text data.
df.drop(['body','name'], 1, inplace=True)
df.fillna(0, inplace=True)
print(df.head())
[5 rows x 12 columns]
def handle_non_numerical_data(df):
columns = df.columns.values
for column in columns:
text_digit_vals = {}
def convert_to_int(val):
return text_digit_vals[val]
return df
df = handle_non_numerical_data(df)
6
print(df.head())
X = np.array(df.drop(['survived'], 1).astype(float))
X = preprocessing.scale(X)
y = np.array(df['survived'])
clf = KMeans(n_clusters=2)
clf.fit(X)
correct = 0
for i in range(len(X)):
predict_me = np.array(X[i].astype(float))
predict_me = predict_me.reshape(-1, len(predict_me))
prediction = clf.predict(predict_me)
if prediction[0] == y[i]:
correct += 1
print(correct/len(X))
0.7081741787624141
K-MEANS IN PYTHON
X = np.array([[1, 2],
[1.5, 1.8],
[5, 8 ],
[8, 8],
[1, 0.6],
[9,11]])
7
We take K=2. We build our K Means class:
class K_Means:
def __init__(self, k=2, tol=0.001, max_iter=300):
self.k = k
self.tol = tol
self.max_iter = max_iter
The tol value is our tolerance, which will optimize if the centroid is not moving more than the
tolerance value. The max_iter value is to limit the number of cycles we will run.
def fit(self,data):
self.centroids = {}
for i in range(self.k):
self.centroids[i] = data[i]
To begin, use an empty dictionary, which will contain centroids at end of program. Next, we begin a
for loop which assigns our starting centroids as the first two data samples in our data.
for i in range(self.max_iter):
self.classifications = {}
for i in range(self.k):
self.classifications[i] = []
X = np.array([[1, 2],
[1.5, 1.8],
[5, 8 ],
[8, 8],
[1, 0.6],
8
[9,11]])
colors = 10*["g","r","c","b","k"]
class K_Means:
def __init__(self, k=2, tol=0.001, max_iter=300):
self.k = k
self.tol = tol
self.max_iter = max_iter
def fit(self,data):
self.centroids = {}
for i in range(self.k):
self.centroids[i] = data[i]
for i in range(self.max_iter):
self.classifications = {}
for i in range(self.k):
self.classifications[i] = []
Next, we need to iterate through our features, calculate distances of the features to the current
centroids, and classify them as such:
prev_centroids = dict(self.centroids)
optimized = True
for c in self.centroids:
original_centroid = prev_centroids[c]
current_centroid = self.centroids[c]
if np.sum((current_centroid-original_centroid)/original_centroid*100.0) > self.tol:
print(np.sum((current_centroid-original_centroid)/original_centroid*100.0))
optimized = False
if optimized:
break
9
def predict(self,data):
distances = [np.linalg.norm(data-self.centroids[centroid]) for centroid in self.centroids]
classification = distances.index(min(distances))
return classification
plt.show()
10
3. Repeat step 2 until all the data points are merged into single cluster.
4. We obtain a dendogram at the final step. We cut the dendogram at a certain level to obtain
the final set of clusters.
The closest distance between the two clusters is crucial for the hierarchical clustering. There are
various ways to calculate the distance between two clusters, and these ways decide the rule for
clustering. These measures are called Linkage methods.
1. Single Linkage: It is the Shortest Distance between the closest points of the clusters.
2. Complete Linkage: It is the farthest distance between the two points of two different
clusters. It is one of the popular linkage methods as it forms tighter clusters than single-
linkage.
3. Average Linkage: It is the linkage method in which the distance between each pair of
datasets is added up and then divided by the total number of datasets to calculate the
average distance between two clusters. It is also one of the most popular linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance between the centroid of the
clusters is calculated.
11
MEAN SHIFT CLUSTERING
Mean Shift is a powerful clustering algorithm used in unsupervised learning. Unlike K-means
clustering, it does not make any assumptions; hence it is a non-parametric algorithm.
Mean-shift algorithm basically assigns the datapoints to the clusters iteratively by shifting points
towards the highest density of datapoints i.e. cluster centroid.
The difference between K-Means algorithm and Mean-Shift is that later one does not need to
specify the number of clusters in advance because the number of clusters will be determined by the
algorithm w.r.t data.
It is based on Kernel Density Estimation(KDE), which is a way to estimate the probability density
function of a random variable. KDE is a problem where the inferences of the population are made by
data smoothing. It works by providing weights to each data point. The weight function is called a
kernel. There are many kinds of kernels- Gaussian kernel, rectangular kernel, flat kernel, etc. Adding
all those kernels together creates a density function (probability surface).
The KDE surface where our data points are distributed in the surface plot. The hills can be
considered as the kernel.
We can make the points climb uphill to the nearest peak on the KDE surface. So, iteratively shifting
each point to climb uphill to the peak. The bandwidth parameter used to make the KDE surface
varies on the different sizes. For example, we have a tall skinny kernel which means a small kernel
bandwidth and in a case where the size of the kernel is short and fat, which means a large kernel
bandwidth. A small kernel bandwidth makes the KDE surface hold the peak for every data point
more formally, saying each point has its cluster; on the other hand, large kernel bandwidth results in
fewer kernels or fewer clusters.
Step 1 – Define a window ( bandwidth of the Kernel to be used for estimation) and place the
window on a data point.
Step 2 – Find mean of all the points within the window.
Step 3 – Move the window to the location of the mean.
Step 4 – Repeat step 2-3 until convergence.
12
NAÏVE BAYES CLASSIFIER
o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o It is a probabilistic classifier, which means it predicts on the basis of the probability of an
object.
o The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
• Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple without
depending on each other.
• Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:
where,
• P(A|B) is Posterior probability. P(A | B) is the probability of event A, given the event B is true
(has occured). Event B is also termed as evidence.
• P(B|A) is Likelihood probability. P(B | A) is the probability of B given event A, i.e. probability
of event B after evidence A is seen.
• P(A) is Prior Probability. P(A) is the priori of A (the prior independent probability, i.e.
probability of event before evidence is seen).
• P(B) is Marginal Probability
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using
this dataset we need to decide that whether we should play or not on a particular day according to
the weather conditions. So to solve this problem, we need to follow the below steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
13
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 4
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
14
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.
model.fit(dataset.data, dataset.target)
print(model)
# make predictions
expected = dataset.target
predicted = model.predict(dataset.data)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))
O U T P UT :
p r e c i s io n r e c a l l f1 - s co r e s u p po r t
15
a cc u r a c y 0.96 150
m a c ro av g 0.96 0.96 0.96 150
w e i g h t e d av g 0 .9 6 0.96 0.96 150
[[50 0 0]
[ 0 47 3]
[ 0 3 47]]
Text Analysis is a major application field for machine learning algorithms. However the raw data, a
sequence of symbols (i.e. strings) cannot be fed directly to the algorithms themselves as most of them
expect numerical feature vectors with a fixed size rather than the raw text documents with variable
length.
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem.
The dataset is divided into two parts, namely, feature matrix and the response/target vector.
• The Feature matrix (X) contains all the vectors(rows) of the dataset in which each vector consists
of the value of dependent features. The number of features is d i.e. X = (x1,x2,x2, xd).
• The Response/target vector (y) contains the value of class/group variable for each row of
feature matrix.
where, y is class variable and X is a dependent feature vector with dimension d i.e. X = (x1,x2,x2,
xd), where d is the number of variables/features of the sample.
• P(y|X) is the probability of observing the class y given the sample X with X = (x1,x2,x2,
xd), where d is the number of variables/features of the sample.
Now the “naïve” conditional independence assumptions come into play: assume that all features
in X are mutually independent, conditional on the category y:
Finally, to find the probability of a given sample for all possible values of the class variable y, we just
need to find the output with maximum probability:
16
Dealing with text data
The raw data is a collection of strings.
Text Analysis is a major application field for machine learning algorithms. However the raw data, a
sequence of symbols (i.e. strings) cannot be fed directly to the algorithms themselves as most of
them expect numerical feature vectors with a fixed size rather than the raw text documents with
variable length.
In order to address this, scikit-learn provides utilities for the most common ways to extract numerical
features from text content, namely:
• tokenizing strings and giving an integer id for each possible token, for instance by using white-
spaces and punctuation as token separators.
• counting the occurrences of tokens in each document.
In this scheme, features and samples are defined as follows:
• each individual token occurrence frequency is treated as a feature.
• the vector of all the token frequencies for a given document is considered a multivariate sample.
“Counting” Example (to really understand this before we move on):
print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
Implementation in Python
Here we consider a multi-class (20 classes) text classification problem.
First, we will load all the necessary libraries:
import numpy as np, pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix, accuracy_score
17
# Get the text categories
text_categories = data.target_names
# define the training set
train_data = fetch_20newsgroups(subset="train", categories=text_categories)
# define the test set
test_data = fetch_20newsgroups(subset="test", categories=text_categories)
Output:
We have 20 unique classes
We have 11314 training samples
We have 7532 test samples
So, this is a 20-class text classification problem with n_train = 11314 training samples (text
sentences) and n_test = 7532 test samples (text sentences).
18
The next step consists of building the Naive Bayes classifier and finally training the model. In our
example, we will convert the collection of text documents (train and test sets) into a matrix of token
counts.
To implement that text transformation we will use the make_pipeline function. This will internally
transform the text data and then the model will be fitted using the transformed data.
The last line of code predicts the labels of the test set.
Finally, let’s build the multi-class confusion matrix to see if the model is good or if the model predicts
correctly only specific text categories.
# plot the confusion matrix
mat = confusion_matrix(test_data.target, predicted_categories)
print("The accuracy is {}".format(accuracy_score(test_data.target, predicted_categories)))
The accuracy is 0.7738980350504514
19
From the above confusion matrix, we can verify that the model is really good.
• It is able to correctly predict all 20 classes of the text data (most values are on the diagonal and
few are off-the-diagonal).
• We also notice that the highest miss-classification (value off-the-diagonal) is 131. The value 131
means that 131 documents that belonged to the “religion miscellaneous ” category were miss-
classified as belonging to the “religion christian” category.
20