Professional Documents
Culture Documents
Source Materials: Data Mining: Concepts and Techniques, Machine Learning
Source Materials: Data Mining: Concepts and Techniques, Machine Learning
1
UNIT –I Introduction
Data Mining, Data Mining Task Primitives,
Types of Data
Data Quality
Data objects and attributes.
Objects
Attribute is also known
4 Yes Married 120K No
as variable, field,
5 No Divorced 95K Yes
characteristic,
dimension, or feature 6 No Married 60K No
A collection of attributes 7 Yes Divorced 220K No
describe an object 8 No Single 85K Yes
Object is also known as 9 No Married 75K No
record, point, case, 10 No Single 90K Yes
sample, entity, or 10
instance
Attribute Values
Attribute values are numbers or symbols assigned to an
attribute for a particular object
values
Example: height can be measured in feet or meters
integers
But properties of attribute values can be different
Attributes
Types:
Nominal
Binary
Numeric: quantitative
Interval-scaled
Ratio-scaled
12
Attribute Types
Nominal: categories, states, or “names of things”
Hair_color = {auburn, black, blond, brown, grey, red, white}
marital status, occupation, ID numbers, zip codes
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV
positive)
Ordinal
Values have a meaningful order (ranking) but magnitude between
successive values is not known.
Size = {small, medium, large}, grades, Professional rankings
13
Numeric Attribute Types
Quantity (integer or real-valued)
Interval
Measured on a scale of equal-sized units
Values have order
E.g., temperature in C˚or F˚, calendar dates
No true zero-point
Ratio
Inherent zero-point
We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as
high as 5 K˚).
e.g., temperature in Kelvin, years of experience(e.g., the
objects are employees) and number of words (e.g., the objects
are documents),length, counts, monetary quantities
14
Discrete vs. Continuous
Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
collection of documents
Sometimes, represented as integer variables
attributes
Continuous Attribute
Has real numbers as attribute values
floating-point variables
15
Properties of Attribute Values
The type of an attribute depends on which of the
following properties/operations it possesses:
Distinctness: =
Order: < >
Differences are + -
meaningful :
Ratios are * /
meaningful
Nominal attribute: distinctness
Ordinal attribute: distinctness & order
Interval attribute: distinctness, order & meaningful
differences
Ratio attribute: all 4 properties/operations
Discrete and Continuous
Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
Examples: zip codes, counts, or the set of words in a
collection of documents
Often represented as integer variables.
Note: binary attributes are a special case of discrete
attributes
Continuous Attribute
Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically, real values can only be measured and represented
using a finite number of digits.
Continuous attributes are typically represented as floating-
point variables.
Types of data sets
Record
Data Matrix
Document Data
Transaction Data
Graph
World Wide Web
Molecular Structures
Ordered
Spatial Data
Temporal Data
Sequential Data
Genetic Sequence Data
Record Data
Data that consists of a collection of records,
each of which consists of a fixed set of
attributes
timeout
season
coach
game
score
play
team
win
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
A special type of data, where
Each transaction involves a set of items.
For example, consider a grocery store. The set of
products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the
items.
Can represent transaction data as record data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
Examples: Generic graph, a molecule, and
webpages
2
5 1
2
5
An element
of the
sequence
Ordered Data
Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data
Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
Example: A Web Mining Framework
27
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
29
KDD Process: A Typical View from ML and
Statistics
30
Example: Medical Data Mining
31
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
32
Multi-Dimensional View of Data Mining
Data to be mined
Database data (extended-relational, object-oriented, heterogeneous,
Techniques utilized
Data-intensive, data warehouse (OLAP), machine learning, statistics,
35
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
36
Data Mining Function: (1) Generalization
Information integration and data warehouse construction
Data cleaning, transformation, integration, and
multidimensional data model
Data cube technology
Scalable methods for computing (i.e., materializing)
multidimensional aggregates
OLAP (online analytical processing)
Multidimensional concept description: Characterization
and discrimination
Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet region
37
Data Mining Function: (2) Association
and Correlation Analysis
Frequent patterns (or frequent itemsets)
What items are frequently purchased together in your
Walmart?
Association, correlation vs. causality
A typical association rule
Diaper Beer [0.5%, 75%] (support, confidence)
Are strongly associated items also strongly correlated?
How to mine such patterns and rules efficiently in large
datasets?
How to use such patterns for classification, clustering,
and other applications?
38
Data Mining Function: (3) Classification
39
Data Mining Function: (4) Cluster Analysis
40
Data Mining Function: (5) Outlier Analysis
Outlier analysis
Outlier: A data object that does not comply with the general
behavior of the data
Noise or exception? ― One person’s garbage could be another
person’s treasure
Methods: by product of clustering or regression analysis, …
Useful in fraud detection, rare events analysis
41
Time and Ordering: Sequential Pattern,
Trend and Evolution Analysis
Sequence, trend and evolution analysis
Trend, time-series, and deviation analysis: e.g., regression
cards
Periodicity analysis
Similarity-based analysis
42
Structure and Network Analysis
Graph mining
Finding frequent subgraphs (e.g., chemical compounds), trees
family, classmates, …
Links carry a lot of semantic information: Link mining
Web mining
Web is a big information network: from PageRank to Google
43
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
44
Data Mining: Confluence of Multiple Disciplines
45
Why Confluence of Multiple Disciplines?
Tremendous amount of data
Algorithms must be highly scalable to handle such as tera-bytes of
data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
46
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
47
Applications of Data Mining
Web page analysis: from web page classification, clustering to
PageRank & HITS algorithms
Collaborative analysis & recommender systems
Basket data analysis to targeted marketing
Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological
network analysis
Data mining and software engineering (e.g., IEEE Computer, Aug.
2009 issue)
From major dedicated data mining systems/tools (e.g., SAS, MS SQL-
Server Analysis Manager, Oracle Data Mining Tools) to invisible data
mining
48
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
49
Major Issues in Data Mining (1)
Mining Methodology
Mining various and new kinds of knowledge
Mining knowledge in multi-dimensional space
Data mining: An interdisciplinary effort
Boosting the power of discovery in a networked environment
Handling noise, uncertainty, and incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining
User Interaction
Interactive mining
Incorporation of background knowledge
Presentation and visualization of data mining results
50
Major Issues in Data Mining (2)
51
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
52
Summary
Data mining: Discovering interesting patterns and knowledge from
massive amount of data
A natural evolution of database technology, in great demand, with
wide applications
A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
Mining can be performed in a variety of data
Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
Data mining technologies and applications
Major issues in data mining
53