Module 1

Core Concepts of Data Mining
Why Mine Data?

Objectives
Objective
Explain the history
and purpose of
data mining across
multiple disciplines
Why Mine Data? Commercial Viewpoint
|Volume of Data
Being Collected
- Web data, e-commerce
- Purchases at stores
- Band/Credit Card
transactions
|Computers are
cheaper and more
powerful
Competitive Pressure
Provide better, customized services for an edge
Produces Process over 300 hours of

about 25-30 20 petabytes video are
frames per of data per day uploaded
second (June 2018) every minute
(June 2018)
Why Mine Data? Scientific Viewpoint
Data collected and stored at enormous speeds

(GB/hour)
Remote Telescopes Microarrays Scientific

sensors on scanning the generating simulations
satellites sky gene generating
expression terabytes of
data data
Why Mine Data? Scientific Viewpoint
|Traditional |Data mining may help

techniques infeasible scientists
for raw data - In classifying and
segmenting data
- Hypothesis formation
Why Mine Data? Medical Viewpoint
Data Size for Where can How do we Which

just 1 single interesting search data features are
sensor over data be for interesting important for
80 years – stored? features? future
1.35 TB per knowledge
person discovery?
Introduction to Data Mining
Objective
Objective
Explain the history
and purpose of
data mining across
multiple disciplines
What is Data Mining?
Many Definitions:
• Non-trivial extraction of
implicit, previously
unknown and potentially
useful information from
data
• Exploration & analysis, by

automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
By Kenneth Jensen [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)],
via Wikimedia Commons
What is(not) Data Mining?
|What is NOT Data |What IS Data Mining?

Mining? - Which names are more
- Looking up a phone number prevalent in certain US
in a phone directory locations?
- Querying a search engine - Which documents can be
for information about a topic grouped together by
similarity according to the
results of search engines
according to their context?
Useful Terms: Feature Extraction
Origins of Data Mining
Statistics/
AI
Data
Mining
Machine
Database Learning/P
Systems attern
Recognition
Data Mining Tasks
|Predictive Methods |Descriptive Methods

-Use some variables to -Find human-
predict unknown or interpretable patterns
future values of other that describe the data.
variables.
Data Mining Tasks
Classification
Predictive Regression
Deviation
Detection
Data Mining Tasks
Clustering
Descriptive
Association Rule
Discovery
Classification
Objective
Objective
Describe different
data mining tasks.
Classification
Training Distance Find Closest Configure a

Metric Data Points machine Classes
Data Set
Test Data
Set
Machine Learning Problems
Supervised Unsupervised
Learning Learning
Discrete Classification Clustering

or
categorization
Continuous Regression Dimensionality

Reduction
Definition: Classification
|Given a |Find a model for |Goal: previously

collection of class attribute unseen records
records (training as a function of should be
set ) the values of assigned a class
other attributes. as accurately as
- Each record
contains a set of
possible.
attributes, one of - A test set is used
the attributes is to determine the
the class. accuracy of the
model.
Classification Example
Categorical Categorical Continuous Class
Tid Refund Marital Taxable Cheat

Status Income
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes

6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
Training Set
10 No Single 90K Yes
Classification Example
Refund Marital Taxabl Cheat

Status e
Income
No Single 75K ?
Yes Married 50K ?
No Married 150K ?
Yes Divorced 90K ?
No Single 40K ?
Test Set
No Married 80K ?
Classification
Test
Set
Learn
Training Classifier Mode
Set
Classification Application: Direct Marketing
|Goal:
- Reduce cost of mailing by targeting a set of consumers likely to buy a new
cell-phone product.
Approach
Use the data for a similar product introduced before.
Determine which customers buy/don't buy, this is class

attribute
Collect various demographic/lifestyle/customer-interaction

data
Use this information as input attributes to learn a classifier

model
Classification Application: Fraud Detection
|Goal:
- Predict fraudulent cases in credit card transactions.
Approach
Use credit card transactions and the information on its

account-holder as attributes.
Label past transactions as fraud or fair transactions. This = the

class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit card

transactions
Classification Application: Image Recognition
Image Recognition: Classify an image into one of the pre-defined categories
| Object Recognition | Face Recognition

Classifying Galaxies
Data Size: Class: Attributes:

72 million stars, 20 million Stages of Formation Image features,
galaxies Characteristics of light
Object Catalog: 9 GB waves received, etc.
Image Database: 150 GB
The Classification Framework
|Apply a prediction -F ( ) “Apple”

function to a feature
representation of the
image to get the
desired output: -F ( ) “Tomato”
-F ( ) “Cow”
Additional Data Mining Tasks
Objective
Objective
Describe different
data mining tasks
The Machine Learning Framework
|Training: given a training set of

Y = f(x) labeled examples {(x1,y1), …,
(xN,yN)}, estimate the prediction
function f by minimizing the
prediction error on the training
set
|Testing: apply f to a never

before seen test example x and
output the predicted value y =
f(x)
Steps
Training Images Training Labels
Image Training Learned

Features Model
Testing Image Learned Prediction

Features Model
Test Image
Features
|Raw Pixels |Histograms |GIST

Descriptors
Classifiers: Nearest Neighbor
Training Training
examples examples
from class 1 from class 2
f(x) = label of the training example nearest to x

Classifiers: Linear
Find a linear function to separate the classes:
f(x) = sgn(w ⋅ x + b)
Many Classifiers to Choose From
Neural
SVM Naive Bayes
Networks
Bayesian Logistic Randomized

Network Regression Forests
Boosted K-nearest
RBMs
Decision Trees Neighbor
Clustering Definition
|Given a set of data |Similarity Measures:

points, each having a set -Euclidean Distance if
of attributes, and a attributes are
similarity measure among continuous.
them, find clusters such -Other Problem-specific
that Measures.
-Data points in one cluster
are more similar to one
another.
-Data points in separate
clusters are less similar to
one another.
Illustrating Clustering
|Euclidean Distance Based Clustering in 3-D space.

Intracluster distances Intercluster distances
are minimized are maximized
Clustering Application: Market Segmentation
|Goal:
- subdivide a market into distinct subsets of customers where any subset m
conceivably be selected as a market target to be reached with a distinct
marketing mix.
Approach
Collect different attributes of customers based on their

geographical and lifestyle related information.
Find clusters of similar customers.
Measure the clustering quality by observing buying patterns of

customers in same cluster vs. those from different clusters.
Clustering Application: Document Clustering
|Goal:
- To find groups of documents that are similar to each other based on th
important terms appearing in them.
Approach
To identify frequently occurring terms in each document. Form a

similarity measure based on the frequencies of different terms. Use
it to cluster.
Gain
Information Retrieval can utilize the clusters to relate a new

document or search term to clustered documents.
Illustrating Document Clustering
|Clustering Points
- 3204 Articles of Los
Category Total Correctly
Angeles Times. Articles Placed
|Similarity Measure Financial

Foreign
555
341
364
260
- How many words are National 273 36
common in these
Metro 943 746
documents (after some
Sports 738 573
word filtering).
Entertainment 354 278
Association Rule Discovery: Definition
|Given a set of records TID Items
each of which contain 1 Bread, Coke, Milk

2 Beer, Bread
some number of items
3 Beer, Coke, Diapers, Milk
from a given collection; 4 Beer, Bread, Diapers, Milk
- Produce dependency rules 5 Coke, Diapers, Milk
which will predict occurrence
of an item based on
Rules Discovered:
occurrences of other items.
{Milk}---à {Coke}
{Diapers, Milk}à {Beer}
Association Rule Discovery: Marketing and
Sales Promotion
Let the rule discovered be
{Bagels, … } --> {Potato Chips}
|Potato Chips |Bagels in the |Bagels in

as consequent antecedent antecedent and
- Can be used to - Can be used to Potato chips in
determine what see which consequent
should be done products would - Can be used to
to boost its sales. be affected if the see what products
store should be sold
discontinues with Bagels to
selling bagels. promote sale of
Potato chips!
Association Rule Discovery: Supermarket
Shelf Management
|Goal |Approach |A classic rule

- To identify items -Process the - If a customer
that are bought point-of-sale buys diaper and
together by milk, then he or
sufficiently many
data collected
she is very likely
customers. with barcode to buy beer.
scanners to - So, don’t be
find surprised if you
dependencies find six-packs
among items. stacked next to
diapers!
Association Rule Discovery: Inventory
Management
|Goal |Approach
- A consumer appliance repair - Process the data on tools and
company wants to anticipate parts required in previous
the nature of repairs on its repairs at different consumer
consumer products locations and discover the co-
- Wants to keep the service occurrence patterns.
vehicles equipped with right
parts to reduce on number of
visits to consumer
households.
Regression
|Predict |Examples:
- a value of a given - Predicting sales amounts of
continuous valued variable new product based on
based on the values of other advertising expenditure
variables - Predicting wind velocities as a
- Greatly studied in statistics, function of temperature,
neural network fields. humidity, air pressure, etc.
- Time series prediction of
stock market indices.
Deviation/Anomaly Detection
|Detect significant
deviations from normal
behavior
|Applications:
-Credit Card Fraud
Detection
-Network Intrusion
Detection
Challenges of Data Mining
Scalability
Dimensionality
Complex and Heterogeneous Data
Data Ownership and Distribution
Data Quality
Privacy Preservation
Streaming Data
Data Attributes Needed
for Data Mining
Introduction to Data Attributes
Objective
Objective
Recognize
attributes of data
needed for data
mining
What is Data?
| Definition Categorical Categorical Continuous Class
Tid Refund Marital Taxable Cheat

- Collection of data objects
Status Income
and their attributes
1 Yes Single 125K No
| Attribute 2 No Married 100K No
- Property or characteristic of 3 No Single 70K No

an object 4 Yes Married 120K No

| Object
6 No Married 60K No
- Collection of attributes
describe
8 No Single 85K Yes
- Also knows as record, point,
case, sample, entity, or 9 No Married 75K No
instance 10 No Single 90K Yes
Attribute Values
|Attribute values |Distinction |Different

are numbers or between attributes can be
symbols attributes and mapped to the
assigned to an attribute values same set of
attribute - Same attribute values
can be mapped to
- But properties of
different attribute
attribute values
values
can be different
Types of Attributes
Nominal
Ordinal
Interval
Ratio
Properties of Attribute Values
Type Property
Distinctness
= Attribute Type
Nominal
Property
Distinctness
Order
< > Ordinal Distinctness & Order
Interval Distinctness, Order &
Addition
+ - Addition
Ratio All Four Attributes
Multiplication
* /
Attribute Type Description Examples Operations
Nominal Different names Zip codes, Mode, entropy,
employee ID contingency,
numbers, eye correlation, X2 test
color, sex {male,
female}
Ordinal Provide enough Hardness of Median,
information to minerals {good, percentiles, rank
order objects better, best}, correlation, run
grades, street tests, sign tests
numbers
Interval Differences Calendar dates, Mean, stard
between values temperature {C, F} deviation,
are meaningful Pearson’s
Correlation, t and F
tests
Ratio Both differences Temperature in Geometric mean,
and ratios are Kelvin, monetary harmonic mean,
meaningful quantities, counts, percent variation
age, mass, length
electrical current
Attribute Transformation Comments
Level
Nominal Any permutation of values If all employee ID
numbers were reassigned,
would it make any
difference?
Ordinal An order preserving change of values, An attribute

encompassing the notion
of good, better, best can
be represented equally
well by the values {1,2,3}
or by {0.5, 1, 10}
Interval new_value = a* old_value+b The Fahrenheit and
where a and b are constants Celsius temperature
scales differ in terms of
where the zero value is
and the size of a unit
(degree)
Ratio new_value = a* old_value Length can be measured
in meter or feet
Discrete and Continuous Attributes
|Discrete Attribute |Continuous Attribute

-Has only a finite or -Has real numbers as
countably infinite set of attribute values
values -Practically, real values
-Often represented as can only be measured
integer variables. and represented using a
-Note: binary attributes finite number of digits.
are a special case of -Continuous attributes
discrete attributes are typically represented
as floating-point
variables.
Symmetric and Asymmetric Attributes
|Symmetric |Asymmetric
- A binary variable contains - If the outcomes of a binary
two possible outcomes: 1 variable are not equally
(positive/present) and 0 important,
(negative/absent) - Only presence (a non-zero
- If there is no preference for attribute value) is regarded
which outcome should be as important
coded as 0 and which as 1,
the binary variable is called
symmetric
- Both are equally valuable
and important
Types of data sets
• Data Matrix
Record • Document Data
• Transaction Data
Graph • World Wide Web

• Molecular Structures
• Spatial data
Ordered • Temporal Data

• Sequential Data
• Genetic Sequence data
Important Characteristics of Structured
Data
|Dimensionality | Sparsity | Resolution

- Curse of - Only presence - Patterns
Dimensionality counts depend on the
- Text data has - Document-Term scale
dimensionality of matrix is a
the order of tens sparse matrix
of thousands
for Data Mining
Types of Data
Objective
Objective
Recognize
attributes of data
needed for data
mining
Types of Data
Categorical Categorical Continuous Class
Data that consists of Tid Refund Marital Taxable Cheat

a collection of Status Income
records, each of 1 Yes Single 125K No
2 No Married 100K No
which consists of a
3 No Single 70K No
fixed set of attributes 4 Yes Married 120K No

6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
Data Matrix
|If data objects have the |Data set can be

same fixed set of represented by an m by
numeric attributes, the n matrix, where there
data objects can be are m rows, one for
thought of as points in each object, and n
a multi-dimensional columns, one for each
space attribute
- each dimension represents
a distinct attribute
Data Matrix
Projection Projection Distance Load Thickness

of x Load of y Load
10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1
Document Data
T C P B S G W L T S
e o l a c a i o i e
a a a l o m n s m a
m c y l r e t e s
h e e o o
r u n
t
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
|A special type of record TID Items
data, where 1 Bread, Coke, Milk

2 Beer, Bread
- each record (transaction)
involves a set of items. 3 Beer, Coke, Diapers, Milk
4 Beer, Bread, Diapers, Milk
5 Coke, Diapers, Milk
Graph Data
|Generic graph and

HTML Links 5
2
1
2
5
Chemical Data
|Benzene Molecule:
C6H6
Ordered Data
|Sequences of
transactions
Ordered Data
|Spatio-Temporal Data January
Average Monthly Temperature of land and ocean

Data Quality
|What kinds of data |Examples of data

quality problems? quality problems:
- Noise and outliers
|How can we detect - missing values
problems with the - duplicate data
data?
|What can we do about

these problems?
Noise
|Noise refers
to
modification
of original
values
Outliers
|Outliers are data

objects with
characteristics that are
considerably different
than most of the other
data objects in the data
set
Missing Values
|Reasons for missing |Handling missing

values values
- Information is not collected - Eliminate Data Objects
- Attributes may not be - Estimate Missing Values
applicable to all cases - Ignore the Missing Value
During Analysis
Duplicate Data
|Data set may |Examples: |Data cleaning

include data - Same person with - Process of
objects that are multiple email dealing with
duplicates, or addresses duplicate data
issues
almost duplicates
of one another
- Major issue when
merging data from
heterogenous
sources
for Data Mining
Data Preprocessing
Objective
Objective
Recognize
attributes of data
needed for data
mining
Data Preprocessing
Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation
Aggregation
|Combining two or more |Purpose

attributes (or objects) - Data reduction
into a single attribute - Change of scale
- More “stable” data
(or object)
Variation of Precipitation in Australia
Standard Deviation of Standard Deviation of

Average Monthly Precipitation Average Yearly Precipitation
Sampling
|Main technique |Sampling occurs |Data mining

employed for because if is too uses because
data selection expensive to processing
- Preliminary obtain the entire entire database
investigation dataset is too expensive
Key Principles for Effective Sampling
|Using a sample will |A sample is

work almost as well as representative if it has
using the entire data approximately the same
sets, if the sample is property (of interest) as
representative the original set of data
Types of Sampling
Simple Sampling
Random Without
Sampling Replacement
Sampling With Stratified

Replacement Sampling
Sample Size
8000 Points 2000 Points 500 Points

What sample size is necessary to have at least
one object from each of 10 groups?
Curse of Dimensionality
Small Intrinsic Computational

Dimension Expensive
Difficult to
Difficult to
Visualize past
store
3 dimensions
2D space – grades in Grade in Math Grade in

math and physics Physics
Student1 5 5
Each grade ranges from
1(min) to 5(max) Student2 5 4
Total number of possible Student3 2 3

combinations = 5*5 = 25
Student4 1 1
100 students -> each
combination appears 4 Student5 3 4
times on average
Student6 2 3
Student7 4 5
Grade in Grade in Grade in Grade in Grade in DB Grade in

Math Physics CS java mobile
prog.
Student1 5 5 4 5 3 3
Student2 5 4 3 5 4 5
Student3 2 3 2 4 1 2
Student4 1 1 2 1 3 1
Student5 3 4 5 3 4 1
Student6 2 3 4 5 3 1
Student7 4 5 2 1 5 5
Cures of Dimensionality
Definitions of density
and distance between
points, which is
critical for clustering
and outlier detection,
become less
meaningful
Dimensionality Reduction
|Purpose |Techniques
- Avoid curse of - Principle Component
dimensionality Analysis
- Reduces time and memory - Singular Value
required by algorithms Decomposition
- Allows easier visualization - Others
- Eliminates irrelevant • Supervised
features or reduces noise • Non-Linear
Dimensionality Reduction: PCA
|Goal is to find a x2
projection that captures
the largest amount of e
variation in data
x1
|Find the eigenvectors x2

of the covariance matrix
e
|The eigenvectors
define the new space
x1
Original Reduced
Projection Projection Reconstructed
Data Data
Matrix Matrix Data Matrix
Matrix Matrix
Feature Subset Selection
|Another way to |Redundant |Irrelevant features

reduce features - contain no
dimensionality of - duplicate much or information that is
all of the useful for the data
data mining task at
information
contained in one hand
or more other
attributes
Feature Subset Selection Techniques
Brute-force Approach
• Try all possible feature subsets as input to data mining algorithm
• Infeasible: 2n possibilities for n attributes
Embedded Approaches
• Feature selection occurs naturally as part of the data mining
algorithm
Filter Approaches
• Features are selected before data mining algorithm is run
Wrapper Approaches
• Use the data mining algorithm as a black box to find best subset of
attributes, typically without enumerating all possible subsets
Feature Creation Methodologies
|Feature |Mapping Data |Feature

Extraction to New Space Construction
- Domain specific - Combining
features
Mapping Data to a New Space
|Fourier transform
|Wavelet transform
Two Sine Waves Two Sine Waves + Noise Frequency

Discretization Without Using Class Labels
Attribute Transformation
|A function that maps

the entire set of values
of a given attribute to a
new set of replacement
values such that each
old value can be
identified with one of
the new values
- Simple functions: xk, log(x), ex,
|x|
- Standardization and
Normalization
Similarity and Dissimilarity
|Similarity |Dissimilarity
- Numerical measure of how - Numerical measure of how
alike two data objects are. different are two data objects
- Is higher when objects are - Lower when objects are more
more alike. alike
- Often falls in the range [0,1] - Minimum dissimilarity is often
0
- Upper limit varies
|Proximity
- Refers to a similarity or
dissimilarity
Similarity/Dissimilarity for Simple Attributes
Attribute Type Dissimilarity Similarity

Nominal
Ordinal
Interval or Ratio
for Data Mining
Distance Measures
Objective
Objective
Recognize
attributes of data
needed for data
mining
Euclidean Distance
|Standardization is
necessary, if scales
differ.
Euclidean Distance
Point x y P1 P2 P3 p4
P1 0 2 p1 0 2.828 3.162 5.099
P2 2 0 p2 2.828 0 1.414 3.162
P3 3 1 p3 3.162 1.414 0 2
P4 5 1 p4 5.099 3.162 2 0
Minkowski Distance
|A generalization of
Euclidean Distance
Minkowski Distance: Examples
|r=1 |r=2 |r--à

- City block - Euclidean - “supremum” (Lmax
(Manhattan, distance norm, L∞ norm)
taxicab, L1 norm) distance
distance.
Minkowski Distance
L1 P1 P2 P3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
Point x y p4 6 4 2 0
P1 0 2
L2 P1 P2 P3 p4
P2 2 0 p1 0 2.828 3.162 5.099
P3 3 1 p2 2.828 0 1.414 3.162
P4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L3 P1 P2 P3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Common Properties of a Distance
|d(p, q) ≥ 0 for all p and q and d(p, q) = 0 only if

p = q. (Positive definiteness)
|d(p, q) = d(q, p) for all p and q. (Symmetry)
|d(p, r) ≤ d(p, q) + d(q, r) for all points p, q, and r.

(Triangle Inequality)
for Data Mining
Similarity and Correlation measures
Objective
Objective
Recognize
attributes of data
needed for data
mining
Common Properties of a Similarity
|s(p, q) = 1 (or maximum similarity) only if p = q.
|s(p, q) = s(q, p) for all p and q. (Symmetry)

Similarity Between Binary Vectors
|Common situation is |Compute similarities

that objects, p and q, using the following
have only binary quantities
attributes M01 = the number of attributes
where p was 0 and q was 1
M10 = the number of attributes



Similarity Between Binary Vectors
|Simple Matching and Jaccard Coefficients

- SMC = number of matches / number of attributes = (M11 + M00) /
(M01 + M10 + M11 + M00)
- J = number of 11 matches / number of not-both-zero attributes

values= (M11) / (M01 + M10 + M11)
SMC versus Jaccard: Example
p= 1000000000
q= 0000001001
M01 = 2 (the number of attributes where p was 0 and q was 1)
SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) /

(2+1+0+7) = 0.7
J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

Cosine Similarity
If d1 and d2 are two vectors, then
cos( d1, d2 ) = (d1 ∙ d2) / ||d1|| ||d2|| ,
Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 ∙ d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
cos( d1, d2 ) = .3150

Extended Jaccard Coefficient (Tanimoto)
|Variation of Jaccard for

continuous or count
attributes
- Reduces to Jaccard for binary
attributes
Correlation
Correlation
| Correlation is always Example: x = [3, 3, 6], y = [2, 3, 4], n = 3 here

in the range -1 to 1
mean(x) = 4, std(x) = sqrt(3) = 1.732
| A correlation of 1(-1)
means that the mean(y) = 3, std(y) = sqrt(1) = 1
vectors have a
s(x,y) = [(-1)(-1) + (-1)(0) + (2)(1)]/2 = 1.5
perfectly
positive(negative)
corr(x,y) = 1.5 / (1.732*1) = 0.86
linear relationship,
that is pk = a.qk + b,
where a and b are
constants
Correlation
| Correlation is always
in the range -1 to 1
| A correlation of 1(-1) Example: x = [-3, -2, -1, 0, 1, 2, 3], y = [9, 4, 1, 0, 1, 4, 9]

means that the
vectors have a Perfect non-linear relationship: y(k) = x(k)2
perfectly
positive(negative) But corr(x,y) = 0
linear relationship,
that is pk = a.qk + b,
where a and b are
constants
Visually Evaluating Correlation
General Approach for Combining
Similarities
Using Weights to Combine Similarities
|May not want to treat

all attributes the same.
- Use weights wk which are
between 0 and 1 and sum to
1.
Density
|Density-based |Examples
clustering require a - Euclidean Density
notion of density - Probability Density
- Graph-based Density
Euclidean Density – Cell-based
Cell-based density Point counts for each grid cell

Euclidean Density – Center-based
|Euclidean density is the

number of points within
a specified radius of the
point

Module 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 1

Uploaded by

Copyright:

Available Formats

Core Concepts of Data Mining

Why Mine Data?

Provide better, customized services for an edge

Produces Process over 300 hours of

Data collected and stored at enormous speeds

Remote Telescopes Microarrays Scientific

|Traditional |Data mining may help

Data Size for Where can How do we Which

• Exploration & analysis, by

|What is NOT Data |What IS Data Mining?

|Predictive Methods |Descriptive Methods

Training Distance Find Closest Configure a

Discrete Classification Clustering

Continuous Regression Dimensionality

|Given a |Find a model for |Goal: previously

Tid Refund Marital Taxable Cheat

4 Yes Married 120K No

5 No Divorced 95K Yes

Refund Marital Taxabl Cheat

Use the data for a similar product introduced before.

Determine which customers buy/don't buy, this is class

Collect various demographic/lifestyle/customer-interaction

Use this information as input attributes to learn a classifier

Use credit card transactions and the information on its

Label past transactions as fraud or fair transactions. This = the

Learn a model for the class of the transactions.

Use this model to detect fraud by observing credit card

Image Recognition: Classify an image into one of the pre-defined categories

| Object Recognition | Face Recognition

Data Size: Class: Attributes:

|Apply a prediction -F ( ) “Apple”

|Training: given a training set of

|Testing: apply f to a never

Image Training Learned

Testing Image Learned Prediction

|Raw Pixels |Histograms |GIST

f(x) = label of the training example nearest to x

Find a linear function to separate the classes:

Bayesian Logistic Randomized

|Given a set of data |Similarity Measures:

|Euclidean Distance Based Clustering in 3-D space.

Collect different attributes of customers based on their

Find clusters of similar customers.

Measure the clustering quality by observing buying patterns of

To identify frequently occurring terms in each document. Form a

Information Retrieval can utilize the clusters to relate a new

|Similarity Measure Financial

|Given a set of records TID Items

each of which contain 1 Bread, Coke, Milk

|Potato Chips |Bagels in the |Bagels in

|Goal |Approach |A classic rule

| Definition Categorical Categorical Continuous Class

Tid Refund Marital Taxable Cheat

| Attribute 2 No Married 100K No

- Property or characteristic of 3 No Single 70K No

5 No Divorced 95K Yes

|Attribute values |Distinction |Different

Ordinal An order preserving change of values, An attribute

|Discrete Attribute |Continuous Attribute

Graph • World Wide Web

Ordered • Temporal Data

|Dimensionality | Sparsity | Resolution

||d1|| = (33+22+00+55+00+00+00+22+00+00)0.5 = (42) 0.5 = 6.481

||d2|| = (11+00+00+00+00+00+00+11+00+22) 0.5 = (6) 0.5 = 2.245