Download as xlsx, pdf, or txt
Download as xlsx, pdf, or txt
You are on page 1of 8

Questions(Clustering & its Evaluation)

Which clustering algorithm is known for its ability to handle non-linearly separable data?

Which clustering algorithm is based on the concept of centroids and aims to minimize the sum of squared distances between
and centroids?

Which clustering algorithm does not require the number of clusters as an input parameter?

Which clustering algorithm forms clusters by merging the closest instances or clusters?
Which clustering algorithm uses a hierarchical approach, where clusters are iteratively merged or divided?

Which clustering algorithm is sensitive to the initial placement of centroids and may converge to local optima?

Which evaluation metric measures the compactness of clusters and the separation between different clusters?

Which evaluation metric measures the size of clusters against the average distance between clusters

Which evaluation metric measures the ratio between-cluster dispersion to within-cluster dispersion?

Which evaluation metric calculates the measures the between-cluster distance against within-cluster distance ?
Questions(PowerBI)

Which Power BI visualization is best suited to show date trends?


Which visualization is suitable for comparing counts by category?

Which visualization can show the count of events by hour of the day?

Which visual is appropriate to display the count of sales by month?

Which visual is suitable for displaying the count of products by region?

Which visualization is commonly used for displaying cumulative counts?


Which visual can be used to compare the count of tickets by priority?
Answer Explanation

DBSCAN is particularly effective when dealing with data that


contains clusters of different shapes, sizes, and densities.
Unlike other clustering algorithms, DBSCAN does not
DBSCAN (Density-Based Spatial require the specification of the number of clusters in
Clustering of Applications with advance. It can identify clusters of arbitrary shapes and can
Noise) algorithm handle noise points as well.

K-means clustering is an iterative algorithm that partitions a


dataset into K clusters, where K is a user-defined parameter
specifying the desired number of clusters. The algorithm
works by iteratively assigning data points to the nearest
centroid and updating the centroids based on the assigned
points. The goal is to minimize the total within-cluster sum
of squared distances, also known as the inertia or
K-means clustering distortion.

Agglomerative Hierarchical Clustering: This bottom-up


approach starts with each data point as an individual cluster
and iteratively merges the closest pairs of clusters based on
a similarity or distance measure. The merging continues
until all data points belong to a single cluster, resulting in a
dendrogram. The number of clusters is determined by
choosing a threshold to cut the dendrogram, forming
distinct clusters at the desired level of similarity.

Divisive Hierarchical Clustering: This top-down approach


begins with all data points in a single cluster and recursively
divides clusters into smaller subclusters. At each step, the
algorithm selects a cluster and splits it into two based on a
similarity or distance measure. The division continues until
The clustering algorithm that does each data point is in its own cluster, resulting in a
not require the number of clusters dendrogram. Similarly to agglomerative clustering, the
as an input parameter is number of clusters is determined by choosing a threshold
Hierarchical Clustering. to cut the dendrogram.

Agglomerative Hierarchical Clustering is a bottom-up


approach to clustering that starts with each data point as
an individual cluster and iteratively merges the closest pairs
of clusters based on a similarity or distance measure. This
process continues until all data points belong to a single
Agglomerative Hierarchical cluster, forming a hierarchical structure known as a
Clustering dendrogram.
Hierarchical Clustering builds a hierarchy of clusters by
iteratively merging or dividing clusters based on a similarity
or distance measure. It creates a dendrogram, which is a
tree-like structure that represents the relationships
between data points or clusters at different levels of
Hierarchical Clustering similarity.

K-means clustering is an iterative algorithm that aims to


The clustering algorithm that is partition a dataset into K clusters by minimizing the sum of
sensitive to the initial placement of squared distances between points and centroids. However,
centroids and may converge to local K-means clustering does not guarantee finding the global
optima is K-means clustering. optimum and is prone to converging to local optima.

he Silhouette Score is a widely used evaluation metric for


clustering algorithms. It provides a measure of how well-
separated clusters are and how tightly grouped the data
points are within each cluster.
The evaluation metric that
measures the compactness of The Silhouette Score for an individual data point is
clusters and the separation calculated using the following formula:
between different clusters is called
Silhouette Score. S(i) = (b(i) - a(i)) / max(a(i), b(i))

The evaluation metric that The Davies-Bouldin Index is a clustering evaluation metric
measures the size of clusters against that quantifies the quality of clustering by considering both
the average distance between the compactness of clusters and the separation between
clusters is called Davies-Bouldin clusters. It takes into account the average distance between
Index. clusters and the average distance within clusters.

The Dunn Index is a clustering evaluation metric that


The evaluation metric that quantifies the compactness of clusters and the separation
measures the ratio between-cluster between clusters. It calculates the ratio of the minimum
dispersion to within-cluster inter-cluster distance to the maximum intra-cluster
dispersion is called Dunn Index. distance.

The Silhouette Score is a widely used clustering evaluation


The evaluation metric that metric that measures the compactness of clusters and the
calculates the ratio of between- separation between different clusters. It provides an
cluster distance to within-cluster indication of how well-separated clusters are and how
distance is called Silhouette Score tightly grouped the data points are within each cluster.

The Line Chart is a versatile visualization for showing trends


over time. It displays data points connected by lines,
making it effective for visualizing changes in values across
different dates. The x-axis can represent the date or time,
Line Chart while the y-axis represents the corresponding data values
The Bar Chart is an effective visualization for comparing and
displaying numerical values or counts across different
categories. It presents data using horizontal or vertical bars,
with the length of each bar representing the magnitude of
Bar Chart the data.

the Line Chart is a suitable visualization for displaying


trends over time, making it effective for showcasing the
count of events by hour of the day. To use the Line Chart to
visualize event counts by hour, you would typically set the
hour of the day on the x-axis (representing time) and the
Line Chart count of events on the y-axis.

In a bar or column chart, each month is represented by a


vertical bar or column, and the height or length of the bar
corresponds to the count of sales for that particular month.
The x-axis typically represents the months, while the y-axis
bar chart or a column chart represents the count of sales.

In a stacked bar chart, each region is represented by a


vertical bar, and the total height of the bar is divided into
segments representing the count of products in each
region. Each segment within the bar corresponds to a
specific region, and the length of the segment represents
the count of products for that region. This type of chart
helps visualize the total count of products in each region
stacked bar chart and also shows the distribution of products across regions.

Cumulative Line Chart: In a cumulative line chart, the


cumulative count is plotted over time or any other ordered
dimension. The x-axis typically represents the time or the
ordered dimension, while the y-axis represents the
cumulative count. The line graph shows the cumulative
count increasing over time, with each data point
representing the cumulative count at a specific point in
time. The line connects these data points, illustrating the
progression of the cumulative count.

Step Chart: A step chart is another common visualization


for displaying cumulative counts. In a step chart, the
cumulative count is represented by a series of horizontal
and vertical lines. Each step represents an increase in the
cumulative count at a specific point in time or the ordered
dimension. The horizontal segments of the step chart show
a constant value until a new data point is reached, where
the vertical segment represents the increase in the
cumulative count. This type of chart emphasizes the
cumulative line chart or a step chart discrete changes in the cumulative count over time.
In a horizontal bar chart, each priority category is
represented by a separate horizontal bar. The length of
each bar corresponds to the count of tickets for that
particular priority. The y-axis typically represents the
priority categories (e.g., low, medium, high), while the x-
horizontal bar chart axis represents the count of tickets.

You might also like