K - Means: Select The K Values

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

K - means

Iterative algorithm which in unsupervised algorithm which tries to group similar items in the form of
clusters. It is used to solve many complex unsupervised machine learning problems.

k-means clustering tries to group similar kinds of items in form of clusters. It finds the similarity
between the items and groups them into the clusters. K-means clustering algorithm works in
three steps.

1. Select the k values.


2. Initialize the centroids.
3. Select the group and find the average.

ADVANTAGES:

1. Easy to implement

2. Scales to larger data set

DISADVANTAGES:

1. Choosing the value of clusters manually

2. Being dependent on initial value

Applications

1. Identifying fake news


2. Classifying network traffic
3. Spam filter
4. Document Analysis
Clustering algorithms:

K-Means, Mean Shift, Spectral, Agglomerative, OPTICS

Square Error: 3 cluster, from centre we calculate distance of each point, and add all distances and
that is my total square error

Divided by total number of points is mean square error

OLAP (On-Line Analytical Processing)


Operations:

1. Drill down: In drill-down operation, the less detailed data is converted into
highly detailed data. It can be done by:
• Moving down in the concept hierarchy
• Adding a new dimension
In the cube given in overview section, the drill down operation is performed by
moving down in the concept hierarchy of Time dimension (Quarter -> Month).

2. Roll up: It is just opposite of the drill-down operation. It performs aggregation


on the OLAP cube. It can be done by:
• Climbing up in the concept hierarchy
• Reducing the dimensions
In the cube given in the overview section, the roll-up operation is performed by
climbing up in the concept hierarchy of Location dimension (City -> Country).

3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more
dimensions. In the cube given in the overview section, a sub-cube is selected by
selecting following dimensions with criteria:
• Location = “Delhi” or “Kolkata”
• Time = “Q1” or “Q2”
• Item = “Car” or “Bus”

4. Slice: It selects a single dimension from the OLAP cube which results in a new
sub-cube creation. In the cube given in the overview section, Slice is performed
on the dimension Time = “Q1”.

5. Pivot: It is also known as rotation operation as it rotates the current view to get
a new view of the representation. In the sub-cube obtained after the slice
operation, performing pivot operation gives a new view of it.
Applications:

1. business reporting for sales, marketing, management reporting,


2. business process management,
3. budgeting and forecasting,
4. financial reporting

Decision Tree
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node
denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds
a class label. The topmost node in the tree is the root node.

Entropy is a measure of disorder or uncertainty and the goal of


machine learning models and Data Scientists in general is to reduce
uncertainty.
Entropy = degree of randomness
50 – 50 => max entropy of 1 (impure data, Very high level of disorder,
Overfitting due to more entropy)

Information gain is the amount of information that's gained by knowing the value of the attribute,
which is the entropy of the distribution before the split minus the entropy of the distribution after it.
The largest information gain is equivalent to the smallest entropy.

Information gain is calculated for a split by subtracting the weighted entropies of each
branch from the original entropy
Info-gain: Entropy - weight average* entropy of each feature
A dataset of 50/50 split of samples for the 2 classes would have a maximum entropy of 1 bit,
whereas an imbalanced dataset with a split of 10/90 would have a smaller entropy as there
would be less surprise for a randomly drawn example from the dataset.

• More the entropy, underfitting


• Less entropy, overfitting
The benefits of having a decision tree are as follows −

• It does not require any domain knowledge.


• It is easy to comprehend.
• The learning and classification steps of a decision tree are simple and fast.

Apriori algorithm (association rule mining)

The Apriori algorithm is used for mining frequent itemsets and devising
association rules from a transactional database. The parameters “support” and
“confidence” are used. Support refers to items’ frequency of
occurrence; confidence is a conditional probability.

Examples:

1. Recommendation systems such as Netflix,


2. amazon (however its more complicated) predicting what a customer can buy if he buys a
certain product.
Number of clusters increasing, the more accuracy, deviation decreases. The fit() method takes the
training data as arguments, which can be one array in the case of unsupervised learning, or two
arrays in the case of supervised learning.

ADVANTAGES

1. Easy to implement

2. Use large item set property

DISADVANTAGES:

1. Requires many database scans

2. Very slow

You might also like