Professional Documents
Culture Documents
Module 1
Module 1
Objective
Explain the history
and purpose of
data mining across
multiple disciplines
Why Mine Data? Commercial Viewpoint
|Volume of Data
Being Collected
- Web data, e-commerce
- Purchases at stores
- Band/Credit Card
transactions
|Computers are
cheaper and more
powerful
Competitive Pressure
Objective
Explain the history
and purpose of
data mining across
multiple disciplines
What is Data Mining?
Many Definitions:
• Non-trivial extraction of
implicit, previously
unknown and potentially
useful information from
data
Statistics/
AI
Data
Mining
Machine
Database Learning/P
Systems attern
Recognition
Data Mining Tasks
Classification
Predictive Regression
Deviation
Detection
Data Mining Tasks
Clustering
Descriptive
Association Rule
Discovery
Core Concepts of Data Mining
Classification
Objective
Objective
Describe different
data mining tasks.
Classification
Test Data
Set
Machine Learning Problems
Supervised Unsupervised
Learning Learning
2 No Married 100K No
3 No Single 70K No
Test
Set
Learn
Training Classifier Mode
Set
Classification Application: Direct Marketing
|Goal:
- Reduce cost of mailing by targeting a set of consumers likely to buy a new
cell-phone product.
Approach
|Goal:
- Predict fraudulent cases in credit card transactions.
Approach
-F ( ) “Cow”
Core Concepts of Data Mining
Additional Data Mining Tasks
Objective
Objective
Describe different
data mining tasks
The Machine Learning Framework
Training Training
examples examples
from class 1 from class 2
f(x) = sgn(w ⋅ x + b)
Many Classifiers to Choose From
Neural
SVM Naive Bayes
Networks
Boosted K-nearest
RBMs
Decision Trees Neighbor
Clustering Definition
|Goal:
- subdivide a market into distinct subsets of customers where any subset m
conceivably be selected as a market target to be reached with a distinct
marketing mix.
Approach
|Goal:
- To find groups of documents that are similar to each other based on th
important terms appearing in them.
Approach
Gain
|Clustering Points
- 3204 Articles of Los
Category Total Correctly
Angeles Times. Articles Placed
|Goal |Approach
- A consumer appliance repair - Process the data on tools and
company wants to anticipate parts required in previous
the nature of repairs on its repairs at different consumer
consumer products locations and discover the co-
- Wants to keep the service occurrence patterns.
vehicles equipped with right
parts to reduce on number of
visits to consumer
households.
Regression
|Predict |Examples:
- a value of a given - Predicting sales amounts of
continuous valued variable new product based on
based on the values of other advertising expenditure
variables - Predicting wind velocities as a
- Greatly studied in statistics, function of temperature,
neural network fields. humidity, air pressure, etc.
- Time series prediction of
stock market indices.
Deviation/Anomaly Detection
|Detect significant
deviations from normal
behavior
|Applications:
-Credit Card Fraud
Detection
-Network Intrusion
Detection
Challenges of Data Mining
Scalability
Dimensionality
Complex and Heterogeneous Data
Data Ownership and Distribution
Data Quality
Privacy Preservation
Streaming Data
Data Attributes Needed
for Data Mining
Introduction to Data Attributes
Objective
Objective
Recognize
attributes of data
needed for data
mining
What is Data?
Nominal
Ordinal
Interval
Ratio
Properties of Attribute Values
Type Property
Distinctness
= Attribute Type
Nominal
Property
Distinctness
Order
< > Ordinal Distinctness & Order
Interval Distinctness, Order &
Addition
+ - Addition
Ratio All Four Attributes
Multiplication
* /
Attribute Type Description Examples Operations
Nominal Different names Zip codes, Mode, entropy,
employee ID contingency,
numbers, eye correlation, X2 test
color, sex {male,
female}
Ordinal Provide enough Hardness of Median,
information to minerals {good, percentiles, rank
order objects better, best}, correlation, run
grades, street tests, sign tests
numbers
Interval Differences Calendar dates, Mean, stard
between values temperature {C, F} deviation,
are meaningful Pearson’s
Correlation, t and F
tests
Ratio Both differences Temperature in Geometric mean,
and ratios are Kelvin, monetary harmonic mean,
meaningful quantities, counts, percent variation
age, mass, length
electrical current
Attribute Transformation Comments
Level
Nominal Any permutation of values If all employee ID
numbers were reassigned,
would it make any
difference?
|Symmetric |Asymmetric
- A binary variable contains - If the outcomes of a binary
two possible outcomes: 1 variable are not equally
(positive/present) and 0 important,
(negative/absent) - Only presence (a non-zero
- If there is no preference for attribute value) is regarded
which outcome should be as important
coded as 0 and which as 1,
the binary variable is called
symmetric
- Both are equally valuable
and important
Types of data sets
• Data Matrix
Record • Document Data
• Transaction Data
• Spatial data
Objective
Recognize
attributes of data
needed for data
mining
Types of Data
2 No Married 100K No
which consists of a
3 No Single 70K No
fixed set of attributes 4 Yes Married 120K No
T C P B S G W L T S
e o l a c a i o i e
a a a l o m n s m a
m c y l r e t e s
h e e o o
r u n
t
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
2
1
2
5
Chemical Data
|Benzene Molecule:
C6H6
Ordered Data
|Sequences of
transactions
Ordered Data
|Noise refers
to
modification
of original
values
Outliers
Objective
Recognize
attributes of data
needed for data
mining
Data Preprocessing
Aggregation
Sampling
Dimensionality Reduction
Feature creation
Attribute Transformation
Aggregation
Simple Sampling
Random Without
Sampling Replacement
Difficult to
Difficult to
Visualize past
store
3 dimensions
Curse of Dimensionality
Student7 4 5
Curse of Dimensionality
Definitions of density
and distance between
points, which is
critical for clustering
and outlier detection,
become less
meaningful
Dimensionality Reduction
|Purpose |Techniques
- Avoid curse of - Principle Component
dimensionality Analysis
- Reduces time and memory - Singular Value
required by algorithms Decomposition
- Allows easier visualization - Others
- Eliminates irrelevant • Supervised
features or reduces noise • Non-Linear
Dimensionality Reduction: PCA
|Goal is to find a x2
projection that captures
the largest amount of e
variation in data
x1
Dimensionality Reduction: PCA
x1
Dimensionality Reduction: PCA
Original Reduced
Projection Projection Reconstructed
Data Data
Matrix Matrix Data Matrix
Matrix Matrix
Feature Subset Selection
Brute-force Approach
• Try all possible feature subsets as input to data mining algorithm
• Infeasible: 2n possibilities for n attributes
Embedded Approaches
• Feature selection occurs naturally as part of the data mining
algorithm
Filter Approaches
• Features are selected before data mining algorithm is run
Wrapper Approaches
• Use the data mining algorithm as a black box to find best subset of
attributes, typically without enumerating all possible subsets
Feature Creation Methodologies
|Fourier transform
|Wavelet transform
|Similarity |Dissimilarity
- Numerical measure of how - Numerical measure of how
alike two data objects are. different are two data objects
- Is higher when objects are - Lower when objects are more
more alike. alike
- Often falls in the range [0,1] - Minimum dissimilarity is often
0
- Upper limit varies
|Proximity
- Refers to a similarity or
dissimilarity
Similarity/Dissimilarity for Simple Attributes
Ordinal
Interval or Ratio
Data Attributes Needed
for Data Mining
Distance Measures
Objective
Objective
Recognize
attributes of data
needed for data
mining
Euclidean Distance
|Standardization is
necessary, if scales
differ.
Euclidean Distance
Point x y P1 P2 P3 p4
P1 0 2 p1 0 2.828 3.162 5.099
P2 2 0 p2 2.828 0 1.414 3.162
P3 3 1 p3 3.162 1.414 0 2
P4 5 1 p4 5.099 3.162 2 0
Minkowski Distance
|A generalization of
Euclidean Distance
Minkowski Distance: Examples
L1 P1 P2 P3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
Point x y p4 6 4 2 0
P1 0 2
L2 P1 P2 P3 p4
P2 2 0 p1 0 2.828 3.162 5.099
P3 3 1 p2 2.828 0 1.414 3.162
P4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L3 P1 P2 P3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Common Properties of a Distance
Objective
Recognize
attributes of data
needed for data
mining
Common Properties of a Similarity
q= 0000001001
Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 ∙ d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
| Correlation is always
in the range -1 to 1
|Density-based |Examples
clustering require a - Euclidean Density
notion of density - Probability Density
- Graph-based Density
Euclidean Density – Cell-based