21CS63 - Unit1 Practice Questions

Data Mining(21CS63)
UNIT 1: PRACTICE QUESTIONS

Course Coordinator : Dr.Vani V (Sec B)
1. What is the primary reason for the emergence of data mining in the information age?
2. Define data mining and its significance in modern information technology.
3. What are data objects, and how do they differ from attribute types?
4. Describe the basic statistical measures used to describe data.
5. Explain the concept of data similarity and dissimilarity.
6. List the major tasks involved in data pre-processing.
7. What are the methods used for data reduction in data pre-processing?
8. Describe the strategies involved in data transformation.
9. How does data mining contribute to the evolution of information technology?
10. Compare and contrast data objects and attribute types, providing examples for each.
11. Explain the significance of measuring data similarity and dissimilarity in data mining.
12. Discuss the importance of data pre-processing in the data mining process.
13. Differentiate between data cleaning, integration, and reduction, providing examples of
each.
14. How do wavelet transforms and principal component analysis contribute to data
reduction?
15. Describe the process of data transformation and its role in data mining.
16. Compare and contrast data discretization methods, focusing on normalization and
binning.
17. What is data mining? In your answer, address the following:
(a) Is it another hype?
(b) Is it a simple transformation or application of technology developed from
databases, statistics, machine learning, and pattern recognition?
(c) We have presented a view that data mining is the result of the evolution of
database technology. Do you think that data mining is also the result of the evolution
of machine learning research? Can you present such views based on the historical
progress of this discipline? Address the same for the fields of statistics and pattern
recognition.
(d)Describe the steps involved in data mining when viewed as a process of knowledge
discovery.
18. Given the following dataset representing the scores of 10 students in a class: {75, 80,
85, 90, 65, 70, 78, 82, 88, 92}, calculate the mean, median, and standard deviation of
the scores.
19. Suppose that the data for analysis includes the attribute age. The age values for the
data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25,
25, 30,33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) What is the mean of the data? What is the median?
(b) What is the mode of the data? Comment on the data’s modality (i.e.,
bimodal, trimodal, etc.).
(c) What is the midrange of the data?
(d) Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of
the data?
(e) Give the five-number summary of the data.
(f) Show a boxplot of the data.
(g) How is a quantile–quantile plot different from a quantile plot?
20. In a survey of 50 individuals, the following ages were recorded: {25, 30, 35, 40, 45,
50, 55, 60, 65, 70}. Calculate the range of ages observed in the survey.
21. Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
(a) Compute the Euclidean distance between the two objects.
Data Mining(21CS63)
(b) Compute the Manhattan distance between the two objects.

(c) Compute the Minkowski distance between the two objects, using q= 3.
(d) Compute the supremum distance between the two objects.
22. It is important to define or select similarity measures in data analysis. However, there
is no commonly accepted subjective similarity measure. Results can vary depending
on the similarity measures used. Nonetheless, seemingly different similarity measures
may be equivalent after some transformation. Suppose we have the following two-
dimensional data set:
A1 A2
x1 1.5 1.7
x2 2 1.9
x3 1.6 1.8
x4 1.2 1.5
x5 1.5 1.0
(a) Consider the data as two-dimensional data points. Given a new data point, x = (1.4,
1.6) as a query, rank the database points based on similarity with the query using
Euclidean distance, Manhattan distance, supremum distance, and cosine similarity.
(b) Normalize the data set to make the norm of each data point equal to 1. Use Euclidean
distance on the transformed data to rank the data points.
23. A dataset contains the following values representing monthly sales for a product:
{1000, 1500, 1200, 1800, 900, 1300}. Calculate the variance and standard deviation
of the sales data.
24. Consider two sets of data: Set A {3, 5, 7, 9, 11} and Set B {4, 6, 8, 10, 12}. Calculate
the Euclidean distance between these two sets.
25. Calculate the Manhattan distance between the points (2, 3) and (5, 7).
26. Using the Jaccard similarity coefficient, compute the similarity between the sets
{apple, orange, banana} and {orange, banana, grape}.
27. Explain the process of data normalization using the min-max normalization method
with the range form 0 to 1, and apply it to a dataset with values ranging from 10 to
100
28. Given a dataset of ages: {18, 20, 22, 25, 30, 35, 40, 45, 50}, perform equal-width
binning with bin size = 10. Determine the bins and assign each age to its
corresponding bin.
29. Consider the following data (in increasing order) for the attribute age: 13, 15,
16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46,
52, 70.
(a) Use smoothing by bin means to smooth these data, using a bin depth of 3.
Illustrate your steps. Comment on the effect of this technique for the given data.
(b) How might you determine outliers in the data?
(c) What other methods are there for data smoothing?
30. Use the nomalization methods to normalize the following group of data:
200, 300, 400, 600,1000
(a) min-max normalization by setting min D 0 and max D 1
(b) z-score normalization
(c) z-score normalization using the mean absolute deviation instead of standard
deviation
(d) normalization by decimal scaling
31. Using the data for age given in question 29 answer the following:
(a) Use min-max normalization to transform the value 35 for age onto the range
Data Mining(21CS63)
[0.0, 1.0].
(b) Use z-score normalization to transform the value 35 for age, where the standard
deviation of age is 12.94 years.
(c) Use normalization by decimal scaling to transform the value 35 for age.
(d) Comment on which method you would prefer to use for the given data, giving
reasons as to why.
32. Suppose a group of 12 sales price records has been sorted as follows:
5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215.
Partition them into three bins by each of the following methods:
(a) equal-frequency (equal-depth) partitioning
(b) equal-width partitioning
(c) clustering
33. Use a flowchart to summarize the following procedures for attribute subset selection:
(a) stepwise forward selection
(b) stepwise backward elimination
(c) a combination of forward selection and backward elimination

21CS63 - Unit1 Practice Questions

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

21CS63 - Unit1 Practice Questions

Uploaded by

Copyright:

Available Formats

Data Mining(21CS63)

UNIT 1: PRACTICE QUESTIONS

(b) Compute the Manhattan distance between the two objects.

You might also like