Professional Documents
Culture Documents
01 Intro
01 Intro
Mining:
Concepts and Techniques
(3rd ed.)
— Chapter 1 —
11
Data and Information Systems
(DAIS:) Course Structures at CS/UIUC
n Coverage:
Database,
data
mining,
text
information
systems,
Web
and
bioinformatics
n Data
mining
n Intro.
to
data
mining (CS412:
Han,
Chang
— Fall)
n Data
mining:
Principles
and
algorithms (CS512:
Han,
Chang,
Sundaram—Spring
and
Fall)
n Seminar:
Advanced
Topics
in
Data
mining
(CS591Han
— Fall
and
Spring.
1
credit
unit)
n Database
Systems:
n Introd.
to
database
systems
(CS411:
Chang,
Parameswaran,
Sinha
— Spring
and
Fall)
n Advanced
database
systems
(CS511:
Chang,
Parameswaran — Fall
or
Spring)
n Text
information
systems
n Text
information
system
(CS410
Zhai:
Spring)
n Advanced
text
information
systems
(CS598CXZ
(future
CS510)
Zhai:
Fall)
n Bioinformatics
(Saurabh Sinha)
n Yahoo!-‐DAIS
seminar
2
CS 412. Course Page & Class Schedule
n Class
Homepage:
https://wiki.cites.illinois.edu/wiki/display/cs412fa15
n Course
Website
n Course
Information
n Staff
n Newsgroup
(Piazza)
n Grading
n Schedule
n Lecture
Media
n Assignments
n Exams
n Extra
Projects
n Feedback
3
Teaching Staff: The Front-End
n Hobbies:
n Ocean
diving,
mountain
climbing.
n Brief
history
n Taiwan
(BS
in
EE
from
National
Taiwan
University)
n California
(MS
in
CS,
PhD
in
EE
from
Stanford)
n Illinois
(associate
professor
in
CS,
UIUC)
n “Data
mining”:
what
can
you
predict?
East
or
west?
CS
or
EE?
You Tell Me --
MetaExplorer
• source discovery FIND sources
• source modeling
• source indexing db of dbs
MetaIntegrator
• source selection
• schema integration QUERY sources
• query mediation
unified query interface
Project 2. WISDM: Data-aware Search over the Web
8
Demo: Entity Search–
“university of california #location”
Project BigSocial: Social Data Analytics
n Summary
12
Why Data Mining?
n The
Explosive
Growth
of
Data:
from
terabytes
to
petabytes
n Data
collection
and
data
availability
computerized
society
n Major
sources
of
abundant
data
simulation,
…
n Society
and
everyone:
news,
digital
c ameras,
YouTube
n Summary
14
What Is Data Mining?
Task-relevant Data
Data Cleaning
Data Integration
Databases
16
Example: A Web Mining Framework
n Web
mining
usually
involves
n Data
cleaning
n Data mining
knowledge-‐base
17
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
19
Which View Do You Prefer?
n Which
view
do
you
prefer?
n KDD
vs.
ML/Stat.
vs.
Business
Intelligence
n Depending
on
the
data,
applications,
and
your
focus
n Data
Mining
vs.
Data
Exploration
n Business
intelligence
view
n Warehouse,
data
cube,
reporting
but
not
much
mining
n Business
objects
vs.
data
mining
tools
n Supply
chain
example:
mining
vs.
OLAP
vs.
presentation
tools
n Data
presentation
vs.
data
exploration
20
Chapter 1. Introduction
n Why
Data
Mining?
n Summary
21
Multi-Dimensional View of Data Mining
n Data
to
be
mined
n Database
data
(extended-‐relational,
object-‐oriented,
heterogeneous,
n Techniques
utilized
n Data-‐intensive,
data
warehouse
(OLAP),
machine
learning,
statistics,
n Summary
23
Data Mining: On What Kinds of Data?
n Summary
25
Data Mining Function: (1) Generalization
n Information
integration
and
data
warehouse
construction
n Data
cleaning,
transformation,
integration,
and
multidimensional
data
model
n Data
cube
technology
n Scalable
methods
for
computing
(i.e.,
materializing)
multidimensional
aggregates
n OLAP
(online
analytical
processing)
n Multidimensional
concept
description:
Characterization
and
discrimination
n Generalize,
summarize,
and
contrast
data
characteristics,
e.g.,
dry
vs.
wet
region
26
Data Mining Function: (2) Association and
Correlation Analysis
n Frequent
patterns
(or
frequent
itemsets)
n What
items
are
frequently
purchased
together
in
your
Walmart?
n Association,
correlation
vs.
causality
n A
typical
association
rule
n Diaper
à Beer
[0.5%,
75%]
(support,
confidence)
n Are
strongly
associated
items
also
strongly
correlated?
n How
to
mine
such
patterns
and
rules
efficiently
in
large
datasets?
n How
to
use
such
patterns
for
classification,
clustering,
and
other
applications?
27
Data Mining Function: (3) Classification
n Classification
and
label
prediction
n Construct
models
(functions)
based
on
some
training
examples
n Describe
and
distinguish
classes
or
concepts
for
future
prediction
n E.g.,
classify
countries
based
on
(climate),
or
classify
cars
based
on
(gas
mileage)
n Predict
some
unknown
class
labels
n Typical
methods
n Decision
trees,
naïve
Bayesian
classification,
support
vector
machines,
neural
networks,
rule-‐based
classification,
pattern-‐
based
classification,
logistic
regression,
…
n Typical
applications:
n Credit
card
fraud
detection,
direct
marketing,
classifying
stars,
diseases,
web-‐pages,
…
28
Data Mining Function: (4) Cluster Analysis
29
Data Mining Function: (5) Outlier Analysis
n Outlier
analysis
n Outlier:
A
data
object
that
does
not
comply
with
the
general
behavior
of
the
data
n Noise
or
exception?
―
One
person’s
garbage
could
be
another
person’s
treasure
n Methods:
by
product
of
clustering
or
regression
analysis,
…
n Useful
in
fraud
detection,
rare
events
analysis
30
Time and Ordering: Sequential Pattern,
Trend and Evolution Analysis
n Sequence,
trend
and
evolution
analysis
n Trend,
time-‐series,
and
deviation
analysis:
e.g.,
regression
n e.g., first buy digital c amera, then buy large SD memory
cards
n Periodicity
analysis
n Similarity-‐based analysis
31
Structure and Network Analysis
n Graph
mining
n Finding
frequent
subgraphs
(e.g.,
chemical
compounds),
trees
(XML),
classmates,
…
n Links
carry
a
lot
of
semantic
information:
Link
mining
n Web
mining
n Web
is
a
big
information
network:
from
PageRank
to
Google
32
Evaluation of Knowledge
n Are
all
mined
knowledge
interesting?
n One
c an
mine
tremendous
amount
of
“patterns”
…)
n Some
may
not
be
representative,
may
be
transient,
…
n Coverage
n Accuracy
n Timeliness
n …
33
Chapter 1. Introduction
n Why
Data
Mining?
n Summary
34
Data Mining: Confluence of Multiple Disciplines
35
Why Confluence of Multiple Disciplines?
n Tremendous
amount
of
data
n Algorithms
must
be
scalable
to
handle
big
data
36
Chapter 1. Introduction
n Why
Data
Mining?
n Summary
37
Applications of Data Mining
n Web
page
analysis:
from
web
page
classification,
clustering
to
PageRank
&
HITS
algorithms
n Collaborative
analysis
&
recommender
systems
n Basket
data
analysis
to
targeted
marketing
n Biological
and
medical
data
analysis:
classification,
cluster
analysis
(microarray
data
analysis),
biological
sequence
analysis,
biological
network
analysis
n Data
mining
and
software
engineering
n From
major
dedicated
data
mining
systems/tools
(e.g.,
SAS,
MS
SQL-‐
Server
Analysis
Manager,
Oracle
Data
Mining
Tools)
to
invisible
data
mining
38
Summary
n Data
mining:
Discovering
interesting
patterns
and
knowledge
from
massive
amount
of
data
n A
natural
evolution
of
science
and
information
technology,
in
great
demand,
with
wide
applications
n A
KDD
process
includes
data
cleaning,
data
integration,
data
selection,
transformation,
data
mining,
pattern
evaluation,
and
knowledge
presentation
n Mining
can
be
performed
in
a
variety
of
data
n Data
mining
functionalities:
characterization,
discrimination,
association,
classification,
clustering,
trend
and
outlier
analysis,
etc.
n Data
mining
technologies
and
applications
n Major
issues
in
data
mining
39
Major Issues in Data Mining (1)
n Mining
Methodology
n Mining
various
and
new
kinds
of
knowledge
n Mining
knowledge
in
multi-‐dimensional
space
n Data
mining:
An
interdisciplinary
effort
n Boosting
the
power
of
discovery
in
a
networked
environment
n Handling
noise,
uncertainty,
and
incompleteness
of
data
n Pattern
evaluation
and
pattern-‐ or
constraint-‐guided
mining
n User
Interaction
n Interactive
mining
n Incorporation
of
background
knowledge
n Presentation
and
visualization
of
data
mining
results
40
Major Issues in Data Mining (2)
41
A Brief History of Data Mining Society
42
Conferences and Journals on Data Mining
43
Where to Find References? DBLP, CiteSeer, Google
45