Professional Documents
Culture Documents
Chapter 4 Descriptive Data Mining
Chapter 4 Descriptive Data Mining
Chapter 4 Descriptive Data Mining
DESCRIPTIVE DATA MINING • The data in a data set are often said to
be “dirty” and “raw” before they have
INTRODUCTION
been preprocessed
The increase in the use of data-mining
• We need to put them into a form that is
techniques in business has been caused largely
best suited for a data-mining algorithm
by three events
• Data preparation makes heavy use of
• The explosion in the amount of
the descriptive statistics and data
data being produced and
visualization methods
electronically tracked
TREATMENT OF MISSING DATA
• The ability to electronically
warehouse these data • The primary options for
addressing missing data are:
• The affordability of computer
power to analyze the data • To discard observations
with any missing values
Observation: set of recorded values of variables
associated with a single entity • To discard any variable
with missing values
Data-mining approaches can be separated into
two categories: • To fill in missing entries
with estimated values
• Supervised learning—For
prediction and classification • To apply a data-mining
algorithm (such as
• Unsupervised learning—To
classification and
detect patterns and
regression trees) that
relationships in the data
can handle missing
• Thought of as high- values
dimensional descriptive
• Dealing with missing data requires
analytics
understanding of why the data are
• Designed to describe missing and the impact of the missing
patterns and data
relationships in large
• If the missing value is a random
data sets with many
occurrence, it is called a data value
observations of many
missing completely at random (MCAR)
variables
• If the missing values are not completely
DATA PREPARATION
random (i.e., correlated with the values
• Treatment of Missing Data of some other variables), these are
called missing at random (MAR)
• Identification of Outliers and
Erroneous Data
• Data is missing not at random (MNAR) • A critical part of data mining is
if the reason that the value is missing is determining how to represent the
related to the value of the variable measurements of the variables and
which variables to consider
IDENTIFICATION OF OUTLIERS AND ERRONEOUS
DATA • The treatment of categorical variables is
particularly important
• Examining the variables in the data set
by means of summary statistics, • Often data sets contain variables that,
histograms, PivotTables, scatter plots, considered separately, are not
and other tools can uncover data- particularly insightful but that, when
quality issues and outliers combined as ratios, may represent
important relationships
• Closer examination of outliers may
reveal an error or a need for further CLUSTER ANALYSIS
investigation to determine whether the
• Measuring Similarity Between
observation is relevant to the current
Observations
analysis
• Hierarchical Clustering
• A conservative approach is to create • k-Means Clustering
two data sets, one with and one • Hierarchical Clustering Versus k-Means
without outliers, and then construct a Clustering
model on both data sets
• Goal of clustering is to segment
• If a model’s implications depend on the observations into similar groups based
inclusion or exclusion of outliers, then on observed variable
one should spend additional time to
• Can be employed during the data-
track down the cause of the outliers
preparation step to identify variables or
VARIABLE REPRESENTATION observations that can be aggregated or
removed from consideration
• In many data-mining applications, the
number of variables for which data is • Commonly used in marketing to divide
recorded may be prohibitive to analyze customers into different homogenous
groups; known as market segmentation
• Dimension reduction: Process of
removing variables from the analysis • Used to identify outliers
without losing any crucial information
• Clustering methods:
• One way is to examine pairwise
• Bottom- up hierarchical
correlations to detect variables or
clustering starts with each
groups of variables that may supply
observation belonging to its own
similar information
cluster and then sequentially
• Such variables can be aggregated or merges the most similar clusters
removed to allow more parsimonious to create a series of nested
model development clusters
• k-means clustering assigns each
observation to one of k clusters in
a manner such that the customers within a group are
observations assigned to the similar and dissimilar with
same cluster are as similar as respect to key characteristics
possible
• For each customer, KTC has
• Both methods depend on how two corresponding to a vector of
observations are similar—hence, we measurements on seven
have to measure similarity between customer variables, that is,
observations (Age, Female, Income, Married,
Children, Car Loan, Mortgage)
MEASURING SIMILARITY BETWEEN
OBSERVATIONS Example: The observation u = (61, 0, 57881, 1,
2, 0, 0) corresponds to a 61-year-old male with
• When observations include
an annual income of $57,881, married with two
numeric variables, Euclidean
children, but no car loan and no mortgage
distance is the most common
method to measure Figure 4.1: Euclidean Distance
dissimilarity between
Euclidean distance becomes smaller as a pair of
observations
observations become more similar with respect
• Euclidean distance: Most to their variable values
common method to measure
dissimilarity between
observations, when
observations include
continuous variables
Figure 4.2:
Measuring Similarity Between Clusters
HIERARCHICAL CLUSTERING
K-MEANS CLUSTERING