Chapter 4 Descriptive Data Mining

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

CHAPTER 4 • Variable Representation

DESCRIPTIVE DATA MINING • The data in a data set are often said to
be “dirty” and “raw” before they have
INTRODUCTION
been preprocessed
The increase in the use of data-mining
• We need to put them into a form that is
techniques in business has been caused largely
best suited for a data-mining algorithm
by three events
• Data preparation makes heavy use of
• The explosion in the amount of
the descriptive statistics and data
data being produced and
visualization methods
electronically tracked
TREATMENT OF MISSING DATA
• The ability to electronically
warehouse these data • The primary options for
addressing missing data are:
• The affordability of computer
power to analyze the data • To discard observations
with any missing values
Observation: set of recorded values of variables
associated with a single entity • To discard any variable
with missing values
Data-mining approaches can be separated into
two categories: • To fill in missing entries
with estimated values
• Supervised learning—For
prediction and classification • To apply a data-mining
algorithm (such as
• Unsupervised learning—To
classification and
detect patterns and
regression trees) that
relationships in the data
can handle missing
• Thought of as high- values
dimensional descriptive
• Dealing with missing data requires
analytics
understanding of why the data are
• Designed to describe missing and the impact of the missing
patterns and data
relationships in large
• If the missing value is a random
data sets with many
occurrence, it is called a data value
observations of many
missing completely at random (MCAR)
variables
• If the missing values are not completely
DATA PREPARATION
random (i.e., correlated with the values
• Treatment of Missing Data of some other variables), these are
called missing at random (MAR)
• Identification of Outliers and
Erroneous Data
• Data is missing not at random (MNAR) • A critical part of data mining is
if the reason that the value is missing is determining how to represent the
related to the value of the variable measurements of the variables and
which variables to consider
IDENTIFICATION OF OUTLIERS AND ERRONEOUS
DATA • The treatment of categorical variables is
particularly important
• Examining the variables in the data set
by means of summary statistics, • Often data sets contain variables that,
histograms, PivotTables, scatter plots, considered separately, are not
and other tools can uncover data- particularly insightful but that, when
quality issues and outliers combined as ratios, may represent
important relationships
• Closer examination of outliers may
reveal an error or a need for further CLUSTER ANALYSIS
investigation to determine whether the
• Measuring Similarity Between
observation is relevant to the current
Observations
analysis
• Hierarchical Clustering
• A conservative approach is to create • k-Means Clustering
two data sets, one with and one • Hierarchical Clustering Versus k-Means
without outliers, and then construct a Clustering
model on both data sets
• Goal of clustering is to segment
• If a model’s implications depend on the observations into similar groups based
inclusion or exclusion of outliers, then on observed variable
one should spend additional time to
• Can be employed during the data-
track down the cause of the outliers
preparation step to identify variables or
VARIABLE REPRESENTATION observations that can be aggregated or
removed from consideration
• In many data-mining applications, the
number of variables for which data is • Commonly used in marketing to divide
recorded may be prohibitive to analyze customers into different homogenous
groups; known as market segmentation
• Dimension reduction: Process of
removing variables from the analysis • Used to identify outliers
without losing any crucial information
• Clustering methods:
• One way is to examine pairwise
• Bottom- up hierarchical
correlations to detect variables or
clustering starts with each
groups of variables that may supply
observation belonging to its own
similar information
cluster and then sequentially
• Such variables can be aggregated or merges the most similar clusters
removed to allow more parsimonious to create a series of nested
model development clusters
• k-means clustering assigns each
observation to one of k clusters in
a manner such that the customers within a group are
observations assigned to the similar and dissimilar with
same cluster are as similar as respect to key characteristics
possible
• For each customer, KTC has
• Both methods depend on how two corresponding to a vector of
observations are similar—hence, we measurements on seven
have to measure similarity between customer variables, that is,
observations (Age, Female, Income, Married,
Children, Car Loan, Mortgage)
MEASURING SIMILARITY BETWEEN
OBSERVATIONS Example: The observation u = (61, 0, 57881, 1,
2, 0, 0) corresponds to a 61-year-old male with
• When observations include
an annual income of $57,881, married with two
numeric variables, Euclidean
children, but no car loan and no mortgage
distance is the most common
method to measure Figure 4.1: Euclidean Distance
dissimilarity between
Euclidean distance becomes smaller as a pair of
observations
observations become more similar with respect
• Euclidean distance: Most to their variable values
common method to measure
dissimilarity between
observations, when
observations include
continuous variables

• Let observations u = (u1, u2, . . . ,


uq) and v = (v1, v2, . . . , vq) each
comprise measurements of q
variables

• The Euclidean distance between


observations u and v is:
• Euclidean distance is highly influenced
2 2 2

d u , v = ( u1−v 1 ) + ( u2−v 2 ) +∙ ∙ ∙+ ( uq −v q ) by the scale on which variables are
measured

• Common to standardize the


ILLUSTRATION: units of each variable j of each
observation u
• KTC is a financial advising
company that provides • Example: uj, the value of
personalized financial advice to variable j in observation u, is
its clients replaced with its z-score, zj
• KTC would like to segment its • The conversion to z-scores also makes it
customers into several groups easier to identify outlier measurements,
(or clusters) so that the
which can distort the Euclidean distance • Determines the similarity of
between observations two clusters by considering the
similarity between the
• When clustering observations solely on
observations composing either
the basis of categorical variables
cluster
encoded as 0–1, a better measure of
similarity between two observations • Starts with each observation in
can be achieved by counting the its own cluster and then
number of variables with matching iteratively combines the two
values clusters that are the most
similar into a single cluster
• The simplest overlap measure is called
the matching coefficient and is • Given a way to measure
computed by: similarity between
observations, there are several
clustering method alternatives
for comparing observations in
• A weakness of the matching coefficient two clusters to obtain a cluster
is that if two observations both have a 0 similarity measure
entry for a categorical variable, this is • Single linkage
counted as a sign of similarity between
the two observations • Complete linkage

• To avoid misstating similarity due to the • Group average linkage


absence of a feature, a similarity
• Median linkage
measure called Jaccard’s coefficient
does not count matching zero entries
and is computer by:

Table 4.1: Comparison of Similarity Matrixes for


Observations with Binary Variables

Figure 4.2:
Measuring Similarity Between Clusters

HIERARCHICAL CLUSTERING
K-MEANS CLUSTERING

• Given a value of k, the k-means


algorithm randomly partitions
the observations into k clusters

• After all observations have been


assigned to a cluster, the
resulting cluster centroids are
calculated

• Using the updated cluster


centroids, all observations are
reassigned to the cluster with
the closest centroid
• Centroid linkage uses the averaging
concept of cluster centroids to define Figure 4.4: Clustering Observations by Age and
between-cluster similarity Income Using k-Means Clustering with k = 3

• Ward’s method merges two clusters


such that the dissimilarity of the
observations with the resulting single
cluster increases as little as possible

• When McQuitty’s method considers


merging two clusters A and B, the
dissimilarity of the resulting cluster AB
to any other cluster C is calculate as:
((dissimilarity between A and C) + Table 4.2: Average Distances within Clusters
(dissimilarity between B and C))/2) Table 4.3: Distances Between Cluster Centroids
• A dendrogram is a chart that depicts
the set of nested clusters resulting at
each step of aggregation

Figure 4.3: Dendrogram for KTC

HIERARCHICAL CLUSTERING VERSUS K-MEANS


CLUSTERING
ASSOCIATION RULES EVALUATING ASSOCIATION RULES

• Evaluating Association Rules • An association rule is ultimately judged


on how actionable it is and how well it
• Association rules: if-then statements
explains the relationship between item
which convey the likelihood of certain
sets
items being purchased together
• For example, Wal-Mart mined its
• Antecedent: the collection of items (or
transactional data to uncover strong
item set) corresponding to the if portion
evidence of the association rule, “If a
of the rule
customer purchases a Barbie doll, then
• Although association rules are an a customer also purchases a candy bar”
important tool in market basket
• An association rule is useful if it is well
analysis
supported and explain an important
• Consequent: the item set previously unknown relationship
corresponding to the then portion of
Table 4.5: Association Rules for Hy-Vee
the rule

• Support count of an item: number of


transactions in the data that include
that item set

Table 4.4: Shopping-Cart Transactions

• Confidence: Helps identify reliable


association rules

• Lift ratio: Measure to evaluate the


efficiency of a rule

• For the data in Table 4.4, the rule “if


{bread, jelly}, then {peanut butter}” has
confidence = 2/4 = 0.5 and a lift ratio =
0.5/(4/10) = 1.25

You might also like