Professional Documents
Culture Documents
K - Means: Select The K Values
K - Means: Select The K Values
K - Means: Select The K Values
Iterative algorithm which in unsupervised algorithm which tries to group similar items in the form of
clusters. It is used to solve many complex unsupervised machine learning problems.
k-means clustering tries to group similar kinds of items in form of clusters. It finds the similarity
between the items and groups them into the clusters. K-means clustering algorithm works in
three steps.
ADVANTAGES:
1. Easy to implement
DISADVANTAGES:
Applications
Square Error: 3 cluster, from centre we calculate distance of each point, and add all distances and
that is my total square error
1. Drill down: In drill-down operation, the less detailed data is converted into
highly detailed data. It can be done by:
• Moving down in the concept hierarchy
• Adding a new dimension
In the cube given in overview section, the drill down operation is performed by
moving down in the concept hierarchy of Time dimension (Quarter -> Month).
3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more
dimensions. In the cube given in the overview section, a sub-cube is selected by
selecting following dimensions with criteria:
• Location = “Delhi” or “Kolkata”
• Time = “Q1” or “Q2”
• Item = “Car” or “Bus”
4. Slice: It selects a single dimension from the OLAP cube which results in a new
sub-cube creation. In the cube given in the overview section, Slice is performed
on the dimension Time = “Q1”.
5. Pivot: It is also known as rotation operation as it rotates the current view to get
a new view of the representation. In the sub-cube obtained after the slice
operation, performing pivot operation gives a new view of it.
Applications:
Decision Tree
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node
denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds
a class label. The topmost node in the tree is the root node.
Information gain is the amount of information that's gained by knowing the value of the attribute,
which is the entropy of the distribution before the split minus the entropy of the distribution after it.
The largest information gain is equivalent to the smallest entropy.
Information gain is calculated for a split by subtracting the weighted entropies of each
branch from the original entropy
Info-gain: Entropy - weight average* entropy of each feature
A dataset of 50/50 split of samples for the 2 classes would have a maximum entropy of 1 bit,
whereas an imbalanced dataset with a split of 10/90 would have a smaller entropy as there
would be less surprise for a randomly drawn example from the dataset.
The Apriori algorithm is used for mining frequent itemsets and devising
association rules from a transactional database. The parameters “support” and
“confidence” are used. Support refers to items’ frequency of
occurrence; confidence is a conditional probability.
Examples:
ADVANTAGES
1. Easy to implement
DISADVANTAGES:
2. Very slow