MCQ Amt 1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Data Mining

1: What is the primary goal of data mining according to the most commonly accepted defini on?

A) Extrac ng data that supports sta s cal models.

B) Construc ng underlying distribu ons from data.

C) Discovering models for data.

D) Iden fying hidden pa erns in data.

2: In the early days, what was the derogatory term used to describe a empts to extract informa on
from data that was not supported by the data itself?

A) Data Modeling

B) Data Dredging

C) Sta s cal Analysis

D) Data Extrac on

3: Which field ini ally used the term "data mining," and how was it originally perceived?

A) Computer Science; as a posi ve approach to discover pa erns.

B) Sta s cs; as a derogatory term for extrac ng unsupported informa on.

C) Machine Learning; as a synonym for predic ve modeling.

D) Data Analysis; as a descrip ve technique for data visualiza on.

4: Which approach to data mining involves the construc on of a sta s cal model by determining the
underlying distribu on from which visible data is drawn?

A) Machine Learning

B) Computa onal Approaches

C) Sta s cal Modeling

D) Data Dredging
5: In which situa on is machine learning typically considered a good approach in data mining?

A) When the goals of mining are well-defined.

B) When the data is generated from a known process.

C) When there is li le idea of what to look for in the data.

D) When the data size is small.

6: What is the PageRank concept primarily used for in Web mining?

A) Clustering web pages into categories.

B) Summarizing the en re complex structure of the Web.

C) Analyzing user behavior on web pages.

D) Iden fying the most frequently searched keywords.

7: What sta s cal principle helps avoid trea ng random occurrences as real when searching for
events within data?

A) Occam's Razor

B) Bonferroni's Principle

C) Bayes' Theorem

D) The Law of Large Numbers

8: In data mining, what does the PageRank represent for a web page?

A) The number of clicks the page receives.

B) The number of adver sements on the page.

C) The probability that a random walker on the web graph would be on that page.

D) The page's total word count.

9: What is the primary goal of clustering in data mining?

A) Summarizing complex data structures.

B) Assigning labels to data points.

C) Iden fying outliers in the data.

D) Reducing data dimensionality.


10: How does clustering work in data mining?

A) It assigns a unique label to each data point.

B) It groups similar data points together.

C) It reduces the number of data points in the dataset.

D) It performs regression analysis on the data.

11: What is a significant challenge in data mining when dealing with massive amounts of data?

A) Finding the most common data pa erns.

B) Iden fying pa erns that are sta s cally significant.

C) Avoiding overfi ng the data.

D) Dealing with computa onal limita ons.

12: In the example involving the search for terrorists, what does Bonferroni's principle suggest about
the events being detected?

A) They are likely genuine and require further inves ga on.

B) They are sta s cally significant and should be reported.

C) They are likely to be false posi ves and need to be treated with cau on.

D) They are unrelated to the search for terrorists.

13: What is the expected number of events that appear to be evil-doing, according to the example
provided in the text?

A) 10,000

B) 100,000

C) 1,000,000

D) 250,000

14: In the context of data mining, what is the primary goal of feature extrac on from large-scale
data?

A) Reducing the dataset size.

B) Discovering frequent itemsets.

C) Iden fying unusual events.

D) Summarizing data by capturing prominent features.


15: What does collabora ve filtering aim to accomplish in data mining?

A) Iden fying outliers in the data.

B) Recommending items based on similar user preferences.

C) Reducing data dimensionality.

D) Clustering similar data points together.

16: What is the primary purpose of Bonferroni's Principle when applied in data mining?

A) To increase the likelihood of finding genuine pa erns.

B) To reduce computa onal complexity.

C) To avoid trea ng random occurrences as significant.

D) To establish causa on between events.

17: What does the TF.IDF measure aim to iden fy in text documents for categoriza on purposes?

A) The most frequent words in a document.

B) The least common words in a document.

C) The concentra on of significant words in documents.

D) The total word count in a document.

18: Why is it generally preferred to choose a prime number as the number of buckets (B) when
designing a hash func on for hash tables?

A) Prime numbers make the hash func on simpler to implement.

B) Prime numbers ensure a uniform distribu on of hash-keys into buckets.

C) Prime numbers eliminate the need for hash func ons.

D) Prime numbers reduce the chance of collisions.

19: In the context of indexes, what is the primary purpose of an index data structure?

A) To store data records in a compact format.

B) To efficiently retrieve objects based on the values of their fields.

C) To sort data records in ascending order.

D) To encrypt the data for security purposes.


20: How can you adapt a hash func on to handle non-integer data types, such as strings?

A) Convert each character to its ASCII equivalent and sum them.

B) Use a prime number as the number of buckets.

C) Group characters and convert the groups to integers before summing.

D) Implement a recursive algorithm for conver ng data types to integers.

21: What type of data structure is commonly used to build indexes, making it efficient to retrieve
objects based on one or more elements of those objects?

A) Queue

B) Stack

C) Hash Table

D) Linked List

22: In the TF.IDF measure, what does the term "TF" stand for?

A) Total Frequency

B) Term Frequency

C) Text Frequency

D) Token Frequency

23: Why is it important to choose the number of buckets (B) as a prime number in hash func ons
when dealing with non-integer data types like strings?

A) To minimize computa onal complexity.

B) To ensure that hash-keys are unique.

C) To avoid collisions and achieve a more uniform distribu on.

D) To improve the efficiency of the hash func on.

24: What is the primary purpose of the IDF (Inverse Document Frequency) component in the TF.IDF
measure?

A) To calculate the total word count in a document.

B) To determine the number of documents in the collec on.

C) To normalize the term frequency.

D) To evaluate the significance of a term in a document collec on.


25: How does a hash func on help in building an index data structure efficiently?

A) It sorts the data records.

B) It encrypts the data for security.

C) It maps values to unique iden fiers.

D) It distributes data records uniformly into buckets for quick retrieval.

26: What is one common way to implement an index data structure using a hash func on?

A) Sort the data records in ascending order.

B) Encrypt the data records for security.

C) Map data values to their ASCII codes.

D) Assign data records to buckets based on hash-keys.


Clustering

1: What is a common property of Euclidean spaces that is useful for clustering?

A) Low dimensionality

B) High-dimensional vectors

C) Non-Euclidean distance measures

D) Random distribu on of points

2: Which of the following is not a requirement for a func on to be considered a valid distance
measure?

A) Nonnega vity

B) Symmetry

C) Triangle inequality

D) Asymmetry

3: In the hierarchical clustering approach, when do clusters stop being combined?

A) When clusters have a predetermined number of points.

B) When clusters have points spread out over a large region.

C) When there are no more points le to cluster.

D) When clusters have the same centroid.

4: What is the primary challenge posed by high-dimensional spaces in clustering?

A) A large number of possible clusters

B) Difficulty in finding any clusters

C) Almost all pairs of points are equally far away

D) Inability to use Euclidean distance

5: In a high-dimensional Euclidean space, what is the expected cosine of the angle between random
vectors?

A) Close to 0

B) Close to 90 degrees

C) Close to 1

D) Close to 180 degrees


6: In which clustering approach are clusters ini ally formed by assigning each point to the cluster into
which it best fits?

A) Hierarchical clustering

B) Agglomera ve clustering

C) K-means clustering

D) Density-based clustering

7: What is the key dis nc on between clustering in a Euclidean space and clustering in a non-
Euclidean space?

A) In a Euclidean space, the points are vectors of real numbers.

B) In a non-Euclidean space, clustering is not possible.

C) In a non-Euclidean space, the curse of dimensionality is not applicable.

D) In a Euclidean space, points are assigned to clusters based on their angles.

8: In hierarchical clustering, which of the following distance measures is used to determine the
distance between two clusters?

A) Average distance of all pairs of points, one from each cluster

B) Minimum distance between any two points, one from each cluster

C) Maximum distance between all the points and the centroid of the cluster

D) Diameter of the cluster

9: What is a common rule for stopping hierarchical clustering?

A) Stop when there is evidence that the next pair of clusters to be combined yields a bad cluster.

B) Always stop when there is only one cluster le .

C) Stop when the density of the cluster that results from the best merger exceeds a threshold.

D) Stop only when we have a predetermined number of clusters, regardless of the data.

10: Which of the following statements about hierarchical clustering in a Euclidean space is correct?

A) It has a linear me complexity for any number of data points.

B) The best merging strategy is always to combine clusters with the largest radius.

C) The diameter of a cluster is the maximum distance between any two points within the cluster.

D) It cannot be used for datasets with a large number of points due to its cubic me complexity.
11: What is the key dis nc on between hierarchical clustering in a Euclidean space and clustering in
a non-Euclidean space?

A) Euclidean space allows for summarizing clusters by their centroids, while non-Euclidean spaces
do not.

B) Clustering in non-Euclidean spaces is always more efficient.

C) Euclidean space has a linear me complexity for hierarchical clustering.

D) Non-Euclidean spaces have a higher me complexity than Euclidean spaces for clustering.

12: Which of the following op ons is NOT a valid rule for stopping hierarchical clustering?

A) Stop when there is evidence that the next pair of clusters to be combined yields a bad cluster.

B) Stop when there is only one cluster le .

C) Stop when the diameter of the resul ng cluster exceeds a threshold.

D) Stop when the density of the resul ng cluster is below a threshold.

13: What is the me complexity of the efficient implementa on of hierarchical clustering, which uses
a priority queue?

A) O(n)

B) O(n log n)

C) O(n^2)

D) O(n^3)

14: In hierarchical clustering, what does the radius of a cluster represent?

A) The maximum distance between any two points in the cluster.

B) The minimum distance between any two points in the cluster.

C) The average distance of all pairs of points, one from each cluster.

D) The maximum distance between all the points and the centroid of the cluster.

15: In hierarchical clustering in non-Euclidean spaces, what is a clustroid?

A) The point in a cluster that is farthest from the centroid.

B) The point in a cluster that is closest to the centroid.

C) The point in a cluster that represents the average distance between all points.

D) The point in a cluster that represents the center and minimizes some distance criterion.
16: When using edit distance as a distance measure in non-Euclidean hierarchical clustering, how is
the clustroid typically selected?

A) The point that is farthest from the centroid.

B) The point that minimizes the sum of distances to other points in the cluster.

C) The point that maximizes the distance to another point in the cluster.

D) The point that minimizes the maximum distance to other points in the cluster.

17: Which of the following op ons remains a valid criterion for merging clusters in hierarchical
clustering when working in a non-Euclidean space?

A) Merging clusters with the smallest diameter.

B) Merging clusters with the smallest average distance between a point and the clustroid.

C) Merging clusters with the smallest centroid distance.

D) Merging clusters with the largest radius.

18: In non-Euclidean hierarchical clustering, what is the role of the radius of a cluster?

A) It represents the average distance between all pairs of points in the cluster.

B) It represents the maximum distance between any two points in the cluster.

C) It defines the minimum distance criterion for merging clusters.

D) It measures the distance between the clustroid and the centroid.

19: Which of the following is true regarding the stopping criteria for hierarchical clustering in non-
Euclidean spaces?

A) Stopping criteria based on the centroid are not applicable in non-Euclidean spaces.

B) Stopping when there is evidence of a bad cluster is the most commonly used criterion.

C) Stopping when there is only one cluster le is universally recommended.

D) Stopping based on density criteria is not feasible in non-Euclidean spaces.

20: In the k-means algorithm, what is the role of the ini al k points?

A) They represent the centroids of the clusters.

B) They represent the average points of the clusters.

C) They represent the furthest points from each other.

D) They represent the medians of the clusters.


21: What is the core opera on in the k-means algorithm that assigns each point to the nearest
cluster?

A) Upda ng the centroids.

B) Calcula ng the average distances.

C) Finding the furthest point.

D) Reassigning points based on distance.

22: How can you ini alize the clusters for k-means?

A) Randomly select k points from the dataset.

B) Cluster a sample of the data hierarchically.

C) Choose points that have the largest average distance from each other.

D) All of the above.

23: How can you determine the right value of k in k-means clustering?

A) Use trial and error un l the best k is found.

B) Always start with k = 2 and increment gradually un l a good value is found.

C) Run the algorithm for various values of k and choose the one with the lowest average diameter.

D) The true value of k is usually known in advance.

24: What is the key idea behind narrowing down the range for k in k-means clustering?

A) The range should start with a large value of k.

B) The range should be narrowed un l there is too much change in cluster quality.

C) The range should be narrowed to the smallest possible value of k.

D) The range should always be narrowed by half.

25: In the k-means algorithm, what is the primary goal when reassigning points to clusters?

A) Minimizing the maximum distance between points in a cluster.

B) Minimizing the sum of squared distances between points in a cluster.

C) Maximizing the average distance between points in a cluster.

D) Maximizing the number of clusters.


26: When using k-means to find the right value of k, why is it important to choose a range for k that
includes the true number of clusters?

A) To reduce the computa onal complexity.

B) To ensure that k-means always converges to the correct solu on.

C) To avoid excessive clustering itera ons.

D) To detect when the quality of clustering starts to degrade.

BFR

1: What is a key assump on of the BFR (Bradley, Fayyad, and Reina) algorithm for clustering in high-
dimensional Euclidean spaces?

A. Clusters are arbitrary in shape

B. Clusters must be normally distributed

C. Clusters can have any orienta on

D. Clusters are widely separated

2: In the BFR Algorithm, what is the representa on used to summarize clusters?

A. N, the count of points

B. Centroid and standard devia on in each dimension

C. Mahalanobis distance

D. Sum of squared components

3: What is the Mahalanobis distance used for in the BFR Algorithm?

A. To compute the Euclidean distance between points

B. To measure the quality of cluster cohesion

C. To normalize the difference between a point and a cluster centroid

D. To calculate the average radius of clusters

4: How does the BFR Algorithm handle points that are not close to any centroid or cluster?

A. They are added to the retained set

B. They are immediately discarded

C. They are assigned to the cluster with the most points

D. They are stored in a separate file


5: What is the primary advantage of represen ng sets of points in the BFR Algorithm using N, SUM,
and SUMSQ?

A. It reduces the memory requirements

B. It simplifies the calcula on of centroids

C. It allows for easy computa on of cluster variances

D. It speeds up the clustering process

CURE

1: In the CURE (Clustering Using REpresenta ves) algorithm, what is the primary difference in cluster
representa on compared to tradi onal centroid-based clustering?

A. CURE uses Mahalanobis distances

B. CURE uses representa ve points

C. CURE assumes normally distributed clusters

D. CURE relies on hierarchical clustering

2: How is the ini aliza on phase of the CURE algorithm typically started?

A. Selec ng centroids

B. Taking a small sample of data

C. Compu ng Mahalanobis distances

D. Using hierarchical clustering

3: What is the purpose of moving representa ve points in the CURE algorithm?

A. To simplify cluster representa ons

B. To accelerate the clustering process

C. To ensure all representa ve points are near the centroid

D. To merge clusters more effec vely

4: In the CURE algorithm, how are clusters merged during the comple on phase?

A. Based on the Euclidean distance between centroids

B. When any two representa ve points are close

C. When the majority of data points overlap

D. Using a hierarchical clustering approach


5: How does the CURE algorithm assign points to clusters in the final step?

A. By choosing the cluster with the most data points

B. By compu ng Mahalanobis distances

C. By comparing points to the centroid of each cluster

D. By measuring the quality of cluster cohesion

6: What does the GRGPF algorithm primarily focus on in its clustering approach?

A. Clustering in Euclidean spaces

B. Hierarchical clustering only

C. Non-main-memory data handling

D. Clustering using representa ve points

7: In the GRGPF algorithm, what features are included in the representa on of a cluster in main
memory?

A. N, clustroid, and Mahalanobis distances

B. The centroid, variance, and standard devia on

C. N, clustroid, rowsums, and closest/furthest points

D. Centroid, diameter, and standard devia on

8: How does the GRGPF algorithm ini alize the cluster tree?

A. By organizing clusters based on their centroid distances

B. By taking a main-memory sample and clustering hierarchically

C. By assigning each point to its nearest cluster

D. By selec ng cluster representa ves randomly

9: In GRGPF, what happens when a cluster's radius becomes too large?

A. The cluster is immediately merged with its nearest neighbor

B. The cluster's features are recomputed without spli ng

C. The cluster is split into two smaller clusters

D. The cluster is discarded, and its points are assigned to other clusters
10: How does the GRGPF algorithm handle the merging of clusters?

A. By always merging clusters with the highest number of points

B. By compu ng Mahalanobis distances between clusters

C. By considering clusters with common ancestors

D. By only merging clusters that are physically close

13: In the stream-clustering algorithm presented in Sec on 7.6.2, what is the primary purpose of the
Merge Buckets step?

A. To merge stream elements into clusters

B. To combine the sizes of buckets for efficient processing

C. To ensure each bucket represents a fixed number of points

D. To periodically review and update the cluster representa ons

14: In a parallel environment, how is the merging of clusters typically handled when using the map-
reduce strategy for clustering a large collec on of points?

A. Each Map task merges its clusters before sending them to the Reduce task

B. The Reduce task performs pairwise merges based on cluster sizes

C. Merging is not necessary as each task independently computes centroids

D. The Reduce task merges all clusters using a predefined strategy

15: In the stream-compu ng model, what is the sliding window used for?

A. To par on clusters

B. To filter out noisy data

C. To es mate the sum of distances

D. To maintain the most recent points

16: Which of the following is NOT a parameter commonly included in the representa on of a cluster
in the GRGPF Algorithm?

A. Number of points in the cluster

B. Clustroid of the cluster

C. Cluster's centroid in a Euclidean space

D. Distance between cluster points and the clustroid


17: In the BDMO Algorithm's merging step, which of the following is NOT a considera on when
merging two consecu ve buckets?

A. Bucket size is twice the sizes of the two buckets being merged

B. The mestamp of the merged bucket

C. Clusters from the same bucket should not be merged

D. Merging should occur if the sum of rowsums is above a certain limit

18: In a parallel environment using the map-reduce strategy for clustering, what is the primary
responsibility of the Map tasks?

A. Merge clusters produced by other Map tasks

B. Compute centroids and sizes of clusters

C. Ensure an equal distribu on of points to Reduce task

D. Communicate with other Map tasks to coordinate merging

19: Which of the following strategies is typically used to decide which clusters to merge when
answering queries in a stream-clustering algorithm?

A. Merge clusters with the smallest sum of distances to the centroid

B. Merge clusters with the largest sum of distances to the centroid

C. Merge clusters with the same clustroid

D. Merge clusters with the largest number of points

20: In the context of the GRGPF Algorithm, what is the role of the k points that are furthest from the
clustroid of a cluster?

A. They serve as the ini al centroids of the cluster.

B. They are used to es mate the sum of distances in the cluster.

C. They help determine whether two clusters are close enough to merge.

D. They are considered for merging with other clusters.


21: In the stream-compu ng model, when answering a query for the clusters of the most recent m
points, what is the significance of the last 2m points in the selected buckets?

A. They are the points most distant from the centroid.

B. They represent the oldest data in the stream.

C. They are used for clustering ini aliza on.

D. They ensure that the query result covers exactly the last m points.

22: In the BDMO Algorithm for stream clustering, what is the purpose of bucket sizes being
nondecreasing as we go back in me?

A. It ensures that buckets can be merged efficiently.

B. It guarantees that each bucket covers the same me period.

C. It allows for simpler mestamp management.

D. It minimizes the number of buckets needed.

23: In the parallel clustering approach discussed for a compu ng cluster, what is the main role of the
Map tasks?

A. To compute the centroids of clusters.

B. To merge clusters produced by the Reduce task.

C. To distribute clusters to different nodes.

D. To create mul ple Reduce tasks.

24: When choosing which clusters to merge in a compu ng cluster's Reduce task, what is one of the
strategies discussed?

A. Merging clusters with the largest diameters first.

B. Merging clusters based on their order of appearance.

C. Merging clusters with centroids closest to the origin.

D. Merging clusters with the same number of points.

You might also like