Professional Documents
Culture Documents
MCQ Amt 1
MCQ Amt 1
MCQ Amt 1
1: What is the primary goal of data mining according to the most commonly accepted defini on?
2: In the early days, what was the derogatory term used to describe a empts to extract informa on
from data that was not supported by the data itself?
A) Data Modeling
B) Data Dredging
D) Data Extrac on
3: Which field ini ally used the term "data mining," and how was it originally perceived?
4: Which approach to data mining involves the construc on of a sta s cal model by determining the
underlying distribu on from which visible data is drawn?
A) Machine Learning
D) Data Dredging
5: In which situa on is machine learning typically considered a good approach in data mining?
7: What sta s cal principle helps avoid trea ng random occurrences as real when searching for
events within data?
A) Occam's Razor
B) Bonferroni's Principle
C) Bayes' Theorem
8: In data mining, what does the PageRank represent for a web page?
C) The probability that a random walker on the web graph would be on that page.
11: What is a significant challenge in data mining when dealing with massive amounts of data?
12: In the example involving the search for terrorists, what does Bonferroni's principle suggest about
the events being detected?
C) They are likely to be false posi ves and need to be treated with cau on.
13: What is the expected number of events that appear to be evil-doing, according to the example
provided in the text?
A) 10,000
B) 100,000
C) 1,000,000
D) 250,000
14: In the context of data mining, what is the primary goal of feature extrac on from large-scale
data?
16: What is the primary purpose of Bonferroni's Principle when applied in data mining?
17: What does the TF.IDF measure aim to iden fy in text documents for categoriza on purposes?
18: Why is it generally preferred to choose a prime number as the number of buckets (B) when
designing a hash func on for hash tables?
19: In the context of indexes, what is the primary purpose of an index data structure?
21: What type of data structure is commonly used to build indexes, making it efficient to retrieve
objects based on one or more elements of those objects?
A) Queue
B) Stack
C) Hash Table
D) Linked List
22: In the TF.IDF measure, what does the term "TF" stand for?
A) Total Frequency
B) Term Frequency
C) Text Frequency
D) Token Frequency
23: Why is it important to choose the number of buckets (B) as a prime number in hash func ons
when dealing with non-integer data types like strings?
24: What is the primary purpose of the IDF (Inverse Document Frequency) component in the TF.IDF
measure?
26: What is one common way to implement an index data structure using a hash func on?
A) Low dimensionality
B) High-dimensional vectors
2: Which of the following is not a requirement for a func on to be considered a valid distance
measure?
A) Nonnega vity
B) Symmetry
C) Triangle inequality
D) Asymmetry
5: In a high-dimensional Euclidean space, what is the expected cosine of the angle between random
vectors?
A) Close to 0
B) Close to 90 degrees
C) Close to 1
A) Hierarchical clustering
B) Agglomera ve clustering
C) K-means clustering
D) Density-based clustering
7: What is the key dis nc on between clustering in a Euclidean space and clustering in a non-
Euclidean space?
8: In hierarchical clustering, which of the following distance measures is used to determine the
distance between two clusters?
B) Minimum distance between any two points, one from each cluster
C) Maximum distance between all the points and the centroid of the cluster
A) Stop when there is evidence that the next pair of clusters to be combined yields a bad cluster.
C) Stop when the density of the cluster that results from the best merger exceeds a threshold.
D) Stop only when we have a predetermined number of clusters, regardless of the data.
10: Which of the following statements about hierarchical clustering in a Euclidean space is correct?
B) The best merging strategy is always to combine clusters with the largest radius.
C) The diameter of a cluster is the maximum distance between any two points within the cluster.
D) It cannot be used for datasets with a large number of points due to its cubic me complexity.
11: What is the key dis nc on between hierarchical clustering in a Euclidean space and clustering in
a non-Euclidean space?
A) Euclidean space allows for summarizing clusters by their centroids, while non-Euclidean spaces
do not.
D) Non-Euclidean spaces have a higher me complexity than Euclidean spaces for clustering.
12: Which of the following op ons is NOT a valid rule for stopping hierarchical clustering?
A) Stop when there is evidence that the next pair of clusters to be combined yields a bad cluster.
13: What is the me complexity of the efficient implementa on of hierarchical clustering, which uses
a priority queue?
A) O(n)
B) O(n log n)
C) O(n^2)
D) O(n^3)
C) The average distance of all pairs of points, one from each cluster.
D) The maximum distance between all the points and the centroid of the cluster.
C) The point in a cluster that represents the average distance between all points.
D) The point in a cluster that represents the center and minimizes some distance criterion.
16: When using edit distance as a distance measure in non-Euclidean hierarchical clustering, how is
the clustroid typically selected?
B) The point that minimizes the sum of distances to other points in the cluster.
C) The point that maximizes the distance to another point in the cluster.
D) The point that minimizes the maximum distance to other points in the cluster.
17: Which of the following op ons remains a valid criterion for merging clusters in hierarchical
clustering when working in a non-Euclidean space?
B) Merging clusters with the smallest average distance between a point and the clustroid.
18: In non-Euclidean hierarchical clustering, what is the role of the radius of a cluster?
A) It represents the average distance between all pairs of points in the cluster.
B) It represents the maximum distance between any two points in the cluster.
19: Which of the following is true regarding the stopping criteria for hierarchical clustering in non-
Euclidean spaces?
A) Stopping criteria based on the centroid are not applicable in non-Euclidean spaces.
B) Stopping when there is evidence of a bad cluster is the most commonly used criterion.
20: In the k-means algorithm, what is the role of the ini al k points?
22: How can you ini alize the clusters for k-means?
C) Choose points that have the largest average distance from each other.
23: How can you determine the right value of k in k-means clustering?
C) Run the algorithm for various values of k and choose the one with the lowest average diameter.
24: What is the key idea behind narrowing down the range for k in k-means clustering?
B) The range should be narrowed un l there is too much change in cluster quality.
25: In the k-means algorithm, what is the primary goal when reassigning points to clusters?
BFR
1: What is a key assump on of the BFR (Bradley, Fayyad, and Reina) algorithm for clustering in high-
dimensional Euclidean spaces?
C. Mahalanobis distance
4: How does the BFR Algorithm handle points that are not close to any centroid or cluster?
CURE
1: In the CURE (Clustering Using REpresenta ves) algorithm, what is the primary difference in cluster
representa on compared to tradi onal centroid-based clustering?
2: How is the ini aliza on phase of the CURE algorithm typically started?
A. Selec ng centroids
4: In the CURE algorithm, how are clusters merged during the comple on phase?
6: What does the GRGPF algorithm primarily focus on in its clustering approach?
7: In the GRGPF algorithm, what features are included in the representa on of a cluster in main
memory?
8: How does the GRGPF algorithm ini alize the cluster tree?
D. The cluster is discarded, and its points are assigned to other clusters
10: How does the GRGPF algorithm handle the merging of clusters?
13: In the stream-clustering algorithm presented in Sec on 7.6.2, what is the primary purpose of the
Merge Buckets step?
14: In a parallel environment, how is the merging of clusters typically handled when using the map-
reduce strategy for clustering a large collec on of points?
A. Each Map task merges its clusters before sending them to the Reduce task
15: In the stream-compu ng model, what is the sliding window used for?
A. To par on clusters
16: Which of the following is NOT a parameter commonly included in the representa on of a cluster
in the GRGPF Algorithm?
A. Bucket size is twice the sizes of the two buckets being merged
18: In a parallel environment using the map-reduce strategy for clustering, what is the primary
responsibility of the Map tasks?
19: Which of the following strategies is typically used to decide which clusters to merge when
answering queries in a stream-clustering algorithm?
20: In the context of the GRGPF Algorithm, what is the role of the k points that are furthest from the
clustroid of a cluster?
C. They help determine whether two clusters are close enough to merge.
D. They ensure that the query result covers exactly the last m points.
22: In the BDMO Algorithm for stream clustering, what is the purpose of bucket sizes being
nondecreasing as we go back in me?
23: In the parallel clustering approach discussed for a compu ng cluster, what is the main role of the
Map tasks?
24: When choosing which clusters to merge in a compu ng cluster's Reduce task, what is one of the
strategies discussed?