Download as pdf or txt
Download as pdf or txt
You are on page 1of 128

Core Concepts of Data Mining

Why Mine Data?


Objectives

Objective
Explain the history
and purpose of
data mining across
multiple disciplines
Why Mine Data? Commercial Viewpoint

|Volume of Data
Being Collected
- Web data, e-commerce
- Purchases at stores
- Band/Credit Card
transactions

|Computers are
cheaper and more
powerful
Competitive Pressure

Provide better, customized services for an edge

Produces Process over 300 hours of


about 25-30 20 petabytes video are
frames per of data per day uploaded
second (June 2018) every minute
(June 2018)
Why Mine Data? Scientific Viewpoint

Data collected and stored at enormous speeds


(GB/hour)

Remote Telescopes Microarrays Scientific


sensors on scanning the generating simulations
satellites sky gene generating
expression terabytes of
data data
Why Mine Data? Scientific Viewpoint

|Traditional |Data mining may help


techniques infeasible scientists
for raw data - In classifying and
segmenting data
- Hypothesis formation
Why Mine Data? Medical Viewpoint

Data Size for Where can How do we Which


just 1 single interesting search data features are
sensor over data be for interesting important for
80 years – stored? features? future
1.35 TB per knowledge
person discovery?
Core Concepts of Data Mining
Introduction to Data Mining
Objective

Objective
Explain the history
and purpose of
data mining across
multiple disciplines
What is Data Mining?

Many Definitions:
• Non-trivial extraction of
implicit, previously
unknown and potentially
useful information from
data

• Exploration & analysis, by


automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
By Kenneth Jensen [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)],
via Wikimedia Commons
What is(not) Data Mining?

|What is NOT Data |What IS Data Mining?


Mining? - Which names are more
- Looking up a phone number prevalent in certain US
in a phone directory locations?
- Querying a search engine - Which documents can be
for information about a topic grouped together by
similarity according to the
results of search engines
according to their context?
Useful Terms: Feature Extraction
Origins of Data Mining

Statistics/
AI

Data
Mining
Machine
Database Learning/P
Systems attern
Recognition
Data Mining Tasks

|Predictive Methods |Descriptive Methods


-Use some variables to -Find human-
predict unknown or interpretable patterns
future values of other that describe the data.
variables.
Data Mining Tasks

Classification

Predictive Regression

Deviation
Detection
Data Mining Tasks

Clustering
Descriptive
Association Rule
Discovery
Core Concepts of Data Mining
Classification
Objective

Objective
Describe different
data mining tasks.
Classification

Training Distance Find Closest Configure a


Metric Data Points machine Classes
Data Set

Test Data
Set
Machine Learning Problems

Supervised Unsupervised
Learning Learning

Discrete Classification Clustering


or
categorization

Continuous Regression Dimensionality


Reduction
Definition: Classification

|Given a |Find a model for |Goal: previously


collection of class attribute unseen records
records (training as a function of should be
set ) the values of assigned a class
other attributes. as accurately as
- Each record
contains a set of
possible.
attributes, one of - A test set is used
the attributes is to determine the
the class. accuracy of the
model.
Classification Example
Categorical Categorical Continuous Class

Tid Refund Marital Taxable Cheat


Status Income
1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
Training Set
10 No Single 90K Yes
Classification Example

Refund Marital Taxabl Cheat


Status e
Income
No Single 75K ?
Yes Married 50K ?
No Married 150K ?
Yes Divorced 90K ?
No Single 40K ?
Test Set
No Married 80K ?
Classification

Test
Set

Learn
Training Classifier Mode
Set
Classification Application: Direct Marketing

|Goal:
- Reduce cost of mailing by targeting a set of consumers likely to buy a new
cell-phone product.

Approach

Use the data for a similar product introduced before.

Determine which customers buy/don't buy, this is class


attribute

Collect various demographic/lifestyle/customer-interaction


data

Use this information as input attributes to learn a classifier


model
Classification Application: Fraud Detection

|Goal:
- Predict fraudulent cases in credit card transactions.

Approach

Use credit card transactions and the information on its


account-holder as attributes.

Label past transactions as fraud or fair transactions. This = the


class attribute.

Learn a model for the class of the transactions.

Use this model to detect fraud by observing credit card


transactions
Classification Application: Image Recognition

Image Recognition: Classify an image into one of the pre-defined categories

| Object Recognition | Face Recognition


Classifying Galaxies

Data Size: Class: Attributes:


72 million stars, 20 million Stages of Formation Image features,
galaxies Characteristics of light
Object Catalog: 9 GB waves received, etc.
Image Database: 150 GB
The Classification Framework

|Apply a prediction -F ( ) “Apple”


function to a feature
representation of the
image to get the
desired output: -F ( ) “Tomato”

-F ( ) “Cow”
Core Concepts of Data Mining
Additional Data Mining Tasks
Objective

Objective
Describe different
data mining tasks
The Machine Learning Framework

|Training: given a training set of


Y = f(x) labeled examples {(x1,y1), …,
(xN,yN)}, estimate the prediction
function f by minimizing the
prediction error on the training
set

|Testing: apply f to a never


before seen test example x and
output the predicted value y =
f(x)
Steps
Training Images Training Labels

Image Training Learned


Features Model

Testing Image Learned Prediction


Features Model
Test Image
Features

|Raw Pixels |Histograms |GIST


Descriptors
Classifiers: Nearest Neighbor

Training Training
examples examples
from class 1 from class 2

f(x) = label of the training example nearest to x


Classifiers: Linear

Find a linear function to separate the classes:

f(x) = sgn(w ⋅ x + b)
Many Classifiers to Choose From

Neural
SVM Naive Bayes
Networks

Bayesian Logistic Randomized


Network Regression Forests

Boosted K-nearest
RBMs
Decision Trees Neighbor
Clustering Definition

|Given a set of data |Similarity Measures:


points, each having a set -Euclidean Distance if
of attributes, and a attributes are
similarity measure among continuous.
them, find clusters such -Other Problem-specific
that Measures.
-Data points in one cluster
are more similar to one
another.
-Data points in separate
clusters are less similar to
one another.
Illustrating Clustering

|Euclidean Distance Based Clustering in 3-D space.


Intracluster distances Intercluster distances
are minimized are maximized
Clustering Application: Market Segmentation

|Goal:
- subdivide a market into distinct subsets of customers where any subset m
conceivably be selected as a market target to be reached with a distinct
marketing mix.

Approach

Collect different attributes of customers based on their


geographical and lifestyle related information.

Find clusters of similar customers.

Measure the clustering quality by observing buying patterns of


customers in same cluster vs. those from different clusters.
Clustering Application: Document Clustering

|Goal:
- To find groups of documents that are similar to each other based on th
important terms appearing in them.

Approach

To identify frequently occurring terms in each document. Form a


similarity measure based on the frequencies of different terms. Use
it to cluster.

Gain

Information Retrieval can utilize the clusters to relate a new


document or search term to clustered documents.
Illustrating Document Clustering

|Clustering Points
- 3204 Articles of Los
Category Total Correctly
Angeles Times. Articles Placed

|Similarity Measure Financial


Foreign
555
341
364
260
- How many words are National 273 36
common in these
Metro 943 746
documents (after some
Sports 738 573
word filtering).
Entertainment 354 278
Association Rule Discovery: Definition

|Given a set of records TID Items

each of which contain 1 Bread, Coke, Milk


2 Beer, Bread
some number of items
3 Beer, Coke, Diapers, Milk
from a given collection; 4 Beer, Bread, Diapers, Milk
- Produce dependency rules 5 Coke, Diapers, Milk
which will predict occurrence
of an item based on
Rules Discovered:
occurrences of other items.
{Milk}---à {Coke}
{Diapers, Milk}à {Beer}
Association Rule Discovery: Marketing and
Sales Promotion
Let the rule discovered be
{Bagels, … } --> {Potato Chips}

|Potato Chips |Bagels in the |Bagels in


as consequent antecedent antecedent and
- Can be used to - Can be used to Potato chips in
determine what see which consequent
should be done products would - Can be used to
to boost its sales. be affected if the see what products
store should be sold
discontinues with Bagels to
selling bagels. promote sale of
Potato chips!
Association Rule Discovery: Supermarket
Shelf Management

|Goal |Approach |A classic rule


- To identify items -Process the - If a customer
that are bought point-of-sale buys diaper and
together by milk, then he or
sufficiently many
data collected
she is very likely
customers. with barcode to buy beer.
scanners to - So, don’t be
find surprised if you
dependencies find six-packs
among items. stacked next to
diapers!
Association Rule Discovery: Inventory
Management

|Goal |Approach
- A consumer appliance repair - Process the data on tools and
company wants to anticipate parts required in previous
the nature of repairs on its repairs at different consumer
consumer products locations and discover the co-
- Wants to keep the service occurrence patterns.
vehicles equipped with right
parts to reduce on number of
visits to consumer
households.
Regression

|Predict |Examples:
- a value of a given - Predicting sales amounts of
continuous valued variable new product based on
based on the values of other advertising expenditure
variables - Predicting wind velocities as a
- Greatly studied in statistics, function of temperature,
neural network fields. humidity, air pressure, etc.
- Time series prediction of
stock market indices.
Deviation/Anomaly Detection

|Detect significant
deviations from normal
behavior

|Applications:
-Credit Card Fraud
Detection
-Network Intrusion
Detection
Challenges of Data Mining

Scalability
Dimensionality
Complex and Heterogeneous Data
Data Ownership and Distribution
Data Quality
Privacy Preservation
Streaming Data
Data Attributes Needed
for Data Mining
Introduction to Data Attributes
Objective

Objective
Recognize
attributes of data
needed for data
mining
What is Data?

| Definition Categorical Categorical Continuous Class

Tid Refund Marital Taxable Cheat


- Collection of data objects
Status Income
and their attributes
1 Yes Single 125K No

| Attribute 2 No Married 100K No

- Property or characteristic of 3 No Single 70K No


an object 4 Yes Married 120K No

5 No Divorced 95K Yes


| Object
6 No Married 60K No
- Collection of attributes
7 Yes Divorced 220K No
describe
8 No Single 85K Yes
- Also knows as record, point,
case, sample, entity, or 9 No Married 75K No
instance 10 No Single 90K Yes
Attribute Values

|Attribute values |Distinction |Different


are numbers or between attributes can be
symbols attributes and mapped to the
assigned to an attribute values same set of
attribute - Same attribute values
can be mapped to
- But properties of
different attribute
attribute values
values
can be different
Types of Attributes

Nominal

Ordinal

Interval

Ratio
Properties of Attribute Values

Type Property
Distinctness
= Attribute Type
Nominal
Property
Distinctness
Order
< > Ordinal Distinctness & Order
Interval Distinctness, Order &
Addition
+ - Addition
Ratio All Four Attributes
Multiplication
* /
Attribute Type Description Examples Operations
Nominal Different names Zip codes, Mode, entropy,
employee ID contingency,
numbers, eye correlation, X2 test
color, sex {male,
female}
Ordinal Provide enough Hardness of Median,
information to minerals {good, percentiles, rank
order objects better, best}, correlation, run
grades, street tests, sign tests
numbers
Interval Differences Calendar dates, Mean, stard
between values temperature {C, F} deviation,
are meaningful Pearson’s
Correlation, t and F
tests
Ratio Both differences Temperature in Geometric mean,
and ratios are Kelvin, monetary harmonic mean,
meaningful quantities, counts, percent variation
age, mass, length
electrical current
Attribute Transformation Comments
Level
Nominal Any permutation of values If all employee ID
numbers were reassigned,
would it make any
difference?

Ordinal An order preserving change of values, An attribute


encompassing the notion
of good, better, best can
be represented equally
well by the values {1,2,3}
or by {0.5, 1, 10}
Interval new_value = a* old_value+b The Fahrenheit and
where a and b are constants Celsius temperature
scales differ in terms of
where the zero value is
and the size of a unit
(degree)
Ratio new_value = a* old_value Length can be measured
in meter or feet
Discrete and Continuous Attributes

|Discrete Attribute |Continuous Attribute


-Has only a finite or -Has real numbers as
countably infinite set of attribute values
values -Practically, real values
-Often represented as can only be measured
integer variables. and represented using a
-Note: binary attributes finite number of digits.
are a special case of -Continuous attributes
discrete attributes are typically represented
as floating-point
variables.
Symmetric and Asymmetric Attributes

|Symmetric |Asymmetric
- A binary variable contains - If the outcomes of a binary
two possible outcomes: 1 variable are not equally
(positive/present) and 0 important,
(negative/absent) - Only presence (a non-zero
- If there is no preference for attribute value) is regarded
which outcome should be as important
coded as 0 and which as 1,
the binary variable is called
symmetric
- Both are equally valuable
and important
Types of data sets

• Data Matrix
Record • Document Data
• Transaction Data

Graph • World Wide Web


• Molecular Structures

• Spatial data

Ordered • Temporal Data


• Sequential Data
• Genetic Sequence data
Important Characteristics of Structured
Data

|Dimensionality | Sparsity | Resolution


- Curse of - Only presence - Patterns
Dimensionality counts depend on the
- Text data has - Document-Term scale
dimensionality of matrix is a
the order of tens sparse matrix
of thousands
Data Attributes Needed
for Data Mining
Types of Data
Objective

Objective
Recognize
attributes of data
needed for data
mining
Types of Data

Categorical Categorical Continuous Class

Data that consists of Tid Refund Marital Taxable Cheat


a collection of Status Income

records, each of 1 Yes Single 125K No

2 No Married 100K No
which consists of a
3 No Single 70K No
fixed set of attributes 4 Yes Married 120K No

5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
Data Matrix

|If data objects have the |Data set can be


same fixed set of represented by an m by
numeric attributes, the n matrix, where there
data objects can be are m rows, one for
thought of as points in each object, and n
a multi-dimensional columns, one for each
space attribute
- each dimension represents
a distinct attribute
Data Matrix

Projection Projection Distance Load Thickness


of x Load of y Load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1
Document Data

T C P B S G W L T S
e o l a c a i o i e
a a a l o m n s m a
m c y l r e t e s
h e e o o
r u n
t
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data

|A special type of record TID Items

data, where 1 Bread, Coke, Milk


2 Beer, Bread
- each record (transaction)
involves a set of items. 3 Beer, Coke, Diapers, Milk
4 Beer, Bread, Diapers, Milk
5 Coke, Diapers, Milk
Graph Data

|Generic graph and


HTML Links 5

2
1
2

5
Chemical Data

|Benzene Molecule:
C6H6
Ordered Data

|Sequences of
transactions
Ordered Data

|Spatio-Temporal Data January

Average Monthly Temperature of land and ocean


Data Quality

|What kinds of data |Examples of data


quality problems? quality problems:
- Noise and outliers
|How can we detect - missing values
problems with the - duplicate data
data?

|What can we do about


these problems?
Noise

|Noise refers
to
modification
of original
values
Outliers

|Outliers are data


objects with
characteristics that are
considerably different
than most of the other
data objects in the data
set
Missing Values

|Reasons for missing |Handling missing


values values
- Information is not collected - Eliminate Data Objects
- Attributes may not be - Estimate Missing Values
applicable to all cases - Ignore the Missing Value
During Analysis
Duplicate Data

|Data set may |Examples: |Data cleaning


include data - Same person with - Process of
objects that are multiple email dealing with
duplicates, or addresses duplicate data
issues
almost duplicates
of one another
- Major issue when
merging data from
heterogenous
sources
Data Attributes Needed
for Data Mining
Data Preprocessing
Objective

Objective
Recognize
attributes of data
needed for data
mining
Data Preprocessing

Aggregation

Sampling

Dimensionality Reduction

Feature subset selection

Feature creation

Discretization and Binarization

Attribute Transformation
Aggregation

|Combining two or more |Purpose


attributes (or objects) - Data reduction
into a single attribute - Change of scale
- More “stable” data
(or object)
Variation of Precipitation in Australia

Standard Deviation of Standard Deviation of


Average Monthly Precipitation Average Yearly Precipitation
Sampling

|Main technique |Sampling occurs |Data mining


employed for because if is too uses because
data selection expensive to processing
- Preliminary obtain the entire entire database
investigation dataset is too expensive
Key Principles for Effective Sampling

|Using a sample will |A sample is


work almost as well as representative if it has
using the entire data approximately the same
sets, if the sample is property (of interest) as
representative the original set of data
Types of Sampling

Simple Sampling
Random Without
Sampling Replacement

Sampling With Stratified


Replacement Sampling
Sample Size

8000 Points 2000 Points 500 Points


What sample size is necessary to have at least
one object from each of 10 groups?
Curse of Dimensionality

Small Intrinsic Computational


Dimension Expensive

Difficult to
Difficult to
Visualize past
store
3 dimensions
Curse of Dimensionality

2D space – grades in Grade in Math Grade in


math and physics Physics
Student1 5 5
Each grade ranges from
1(min) to 5(max) Student2 5 4

Total number of possible Student3 2 3


combinations = 5*5 = 25
Student4 1 1
100 students -> each
combination appears 4 Student5 3 4
times on average
Student6 2 3

Student7 4 5
Curse of Dimensionality

Grade in Grade in Grade in Grade in Grade in DB Grade in


Math Physics CS java mobile
prog.
Student1 5 5 4 5 3 3
Student2 5 4 3 5 4 5
Student3 2 3 2 4 1 2
Student4 1 1 2 1 3 1
Student5 3 4 5 3 4 1
Student6 2 3 4 5 3 1
Student7 4 5 2 1 5 5
Cures of Dimensionality

Definitions of density
and distance between
points, which is
critical for clustering
and outlier detection,
become less
meaningful
Dimensionality Reduction

|Purpose |Techniques
- Avoid curse of - Principle Component
dimensionality Analysis
- Reduces time and memory - Singular Value
required by algorithms Decomposition
- Allows easier visualization - Others
- Eliminates irrelevant • Supervised
features or reduces noise • Non-Linear
Dimensionality Reduction: PCA

|Goal is to find a x2
projection that captures
the largest amount of e
variation in data

x1
Dimensionality Reduction: PCA

|Find the eigenvectors x2


of the covariance matrix
e
|The eigenvectors
define the new space

x1
Dimensionality Reduction: PCA

Original Reduced
Projection Projection Reconstructed
Data Data
Matrix Matrix Data Matrix
Matrix Matrix
Feature Subset Selection

|Another way to |Redundant |Irrelevant features


reduce features - contain no
dimensionality of - duplicate much or information that is
all of the useful for the data
data mining task at
information
contained in one hand
or more other
attributes
Feature Subset Selection Techniques

Brute-force Approach
• Try all possible feature subsets as input to data mining algorithm
• Infeasible: 2n possibilities for n attributes
Embedded Approaches
• Feature selection occurs naturally as part of the data mining
algorithm
Filter Approaches
• Features are selected before data mining algorithm is run

Wrapper Approaches
• Use the data mining algorithm as a black box to find best subset of
attributes, typically without enumerating all possible subsets
Feature Creation Methodologies

|Feature |Mapping Data |Feature


Extraction to New Space Construction
- Domain specific - Combining
features
Mapping Data to a New Space

|Fourier transform
|Wavelet transform

Two Sine Waves Two Sine Waves + Noise Frequency


Discretization Without Using Class Labels
Attribute Transformation

|A function that maps


the entire set of values
of a given attribute to a
new set of replacement
values such that each
old value can be
identified with one of
the new values
- Simple functions: xk, log(x), ex,
|x|
- Standardization and
Normalization
Similarity and Dissimilarity

|Similarity |Dissimilarity
- Numerical measure of how - Numerical measure of how
alike two data objects are. different are two data objects
- Is higher when objects are - Lower when objects are more
more alike. alike
- Often falls in the range [0,1] - Minimum dissimilarity is often
0
- Upper limit varies

|Proximity
- Refers to a similarity or
dissimilarity
Similarity/Dissimilarity for Simple Attributes

Attribute Type Dissimilarity Similarity


Nominal

Ordinal

Interval or Ratio
Data Attributes Needed
for Data Mining
Distance Measures
Objective

Objective
Recognize
attributes of data
needed for data
mining
Euclidean Distance

|Standardization is
necessary, if scales
differ.
Euclidean Distance

Point x y P1 P2 P3 p4
P1 0 2 p1 0 2.828 3.162 5.099
P2 2 0 p2 2.828 0 1.414 3.162
P3 3 1 p3 3.162 1.414 0 2
P4 5 1 p4 5.099 3.162 2 0
Minkowski Distance

|A generalization of
Euclidean Distance
Minkowski Distance: Examples

|r=1 |r=2 |r--à


- City block - Euclidean - “supremum” (Lmax
(Manhattan, distance norm, L∞ norm)
taxicab, L1 norm) distance
distance.
Minkowski Distance

L1 P1 P2 P3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2

Point x y p4 6 4 2 0

P1 0 2
L2 P1 P2 P3 p4
P2 2 0 p1 0 2.828 3.162 5.099
P3 3 1 p2 2.828 0 1.414 3.162
P4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

L3 P1 P2 P3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Common Properties of a Distance

|d(p, q) ≥ 0 for all p and q and d(p, q) = 0 only if


p = q. (Positive definiteness)

|d(p, q) = d(q, p) for all p and q. (Symmetry)

|d(p, r) ≤ d(p, q) + d(q, r) for all points p, q, and r.


(Triangle Inequality)
Data Attributes Needed
for Data Mining
Similarity and Correlation measures
Objective

Objective
Recognize
attributes of data
needed for data
mining
Common Properties of a Similarity

|s(p, q) = 1 (or maximum similarity) only if p = q.

|s(p, q) = s(q, p) for all p and q. (Symmetry)


Similarity Between Binary Vectors

|Common situation is |Compute similarities


that objects, p and q, using the following
have only binary quantities
attributes M01 = the number of attributes
where p was 0 and q was 1

M10 = the number of attributes


where p was 1 and q was 0

M00 = the number of attributes


where p was 0 and q was 0

M11 = the number of attributes


where p was 1 and q was 1
Similarity Between Binary Vectors

|Simple Matching and Jaccard Coefficients


- SMC = number of matches / number of attributes = (M11 + M00) /
(M01 + M10 + M11 + M00)

- J = number of 11 matches / number of not-both-zero attributes


values= (M11) / (M01 + M10 + M11)
SMC versus Jaccard: Example
p= 1000000000

q= 0000001001

M01 = 2 (the number of attributes where p was 0 and q was 1)

M10 = 1 (the number of attributes where p was 1 and q was 0)

M00 = 7 (the number of attributes where p was 0 and q was 0)

M11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) /


(2+1+0+7) = 0.7

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0


Cosine Similarity
If d1 and d2 are two vectors, then

cos( d1, d2 ) = (d1 ∙ d2) / ||d1|| ||d2|| ,

Example:

d1 = 3 2 0 5 0 0 0 2 0 0

d2 = 1 0 0 0 0 0 0 1 0 2

d1 ∙ d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5

||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481

||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = .3150


Extended Jaccard Coefficient (Tanimoto)

|Variation of Jaccard for


continuous or count
attributes
- Reduces to Jaccard for binary
attributes
Correlation
Correlation

| Correlation is always Example: x = [3, 3, 6], y = [2, 3, 4], n = 3 here


in the range -1 to 1
mean(x) = 4, std(x) = sqrt(3) = 1.732
| A correlation of 1(-1)
means that the mean(y) = 3, std(y) = sqrt(1) = 1
vectors have a
s(x,y) = [(-1)(-1) + (-1)(0) + (2)(1)]/2 = 1.5
perfectly
positive(negative)
corr(x,y) = 1.5 / (1.732*1) = 0.86
linear relationship,
that is pk = a.qk + b,
where a and b are
constants
Correlation

| Correlation is always
in the range -1 to 1

| A correlation of 1(-1) Example: x = [-3, -2, -1, 0, 1, 2, 3], y = [9, 4, 1, 0, 1, 4, 9]


means that the
vectors have a Perfect non-linear relationship: y(k) = x(k)2
perfectly
positive(negative) But corr(x,y) = 0
linear relationship,
that is pk = a.qk + b,
where a and b are
constants
Visually Evaluating Correlation
General Approach for Combining
Similarities
Using Weights to Combine Similarities

|May not want to treat


all attributes the same.
- Use weights wk which are
between 0 and 1 and sum to
1.
Density

|Density-based |Examples
clustering require a - Euclidean Density
notion of density - Probability Density
- Graph-based Density
Euclidean Density – Cell-based

Cell-based density Point counts for each grid cell


Euclidean Density – Center-based

|Euclidean density is the


number of points within
a specified radius of the
point

You might also like