Professional Documents
Culture Documents
Data Mining Answer Key
Data Mining Answer Key
Data cleaning is needed before data analysis because it ensures that the data is
accurate, complete, and consistent. This is important because inaccurate or
incomplete data can lead to inaccurate or misleading results.
Here are some of the reasons why data cleaning is needed before data analysis:
• To remove errors: Data cleaning can help to identify and remove errors
from the data. This includes errors such as typos, missing values, and
inconsistent formatting.
• To make the data complete: Data cleaning can help to identify and fill in
missing values in the data. This is important because missing values can
skew the results of the analysis.
• To make the data consistent: Data cleaning can help to ensure that the data
is consistent in terms of its format, units, and values. This is important
because inconsistent data can make it difficult to analyze the data.
In short, data cleaning is an important step in the data analysis process. It helps to
ensure that the data is accurate, complete, and consistent, which is essential for
getting accurate and reliable results.
• Identifying and removing errors: This includes typos, missing values, and
inconsistent formatting.
• Filling in missing values: This can be done using a variety of methods, such
as interpolation or imputation.
• Categorizing data: This can help to make the data more manageable and
easier to analyze.
• Cleaning up text data: This can involve removing noise, correcting spelling
errors, and normalizing text.
Data cleaning can be a complex and time-consuming process, but it is essential for
getting accurate and reliable results from data analysis.
There are two main types of hierarchical clustering: agglomerative and divisive.
Agglomerative hierarchical clustering starts with each data point as its own cluster
and merges them together until there is only one cluster left. Divisive hierarchical
clustering starts with all the data points in one cluster and then divides them into
smaller and smaller clusters until there are only individual data points left.
The dendrogram is a tree-like diagram that shows the hierarchy of clusters created
by hierarchical clustering. The dendrogram shows how the clusters were merged
together, and it can be used to visualize the relationships between the different
clusters.
0
/ \
1 2
/\ /\
3 4 5 6
The dendrogram shows that the customer data was clustered into six clusters. The
numbers on the dendrogram represent the different clusters. The closer two
numbers are together on the dendrogram, the more similar the two clusters are.
Hierarchical clustering is a powerful tool for grouping data points together based
on their similarity. It is a versatile algorithm that can be used to cluster data from a
variety of domains.
Consider the dataset given below and cluster the dataset by using
Hierarchical clustering and plot the dendogram for it.
Item A B C D E
A 0
B 7 0
C 2 5 0
D 6 4 8 0
E 10 8 3 7 0
[5860]-212 2