Professional Documents
Culture Documents
Chap1 Intro
Chap1 Intro
by Michael Hahsler
Agenda
http://www.kdd.org/curriculum
Why Data Mining?
Commercial Viewpoint
▪ Businesses collect and warehouse lots
of data.
—Purchases at department/grocery stores
—Bank/credit card transactions
—Web and social media data
—Mobile and IOT
▪ Computers are cheaper and more
powerful.
▪ Competition to provide better services.
—Mass customization and recommendation
systems
—Targeted advertising
—Improved logistics
Why Mine Data?
Scientific Viewpoint
▪ Data collected and stored at
enormous speeds (GB/hour)
—remote sensors on a satellite
—telescopes scanning the skies
—microarrays generating gene
expression data
—scientific simulations
generating terabytes of data
▪ Data mining may help scientists
—identify patterns and relationships
—to classify and segment data
—formulate hypotheses
Knowledge Discovery in Databases (KDD) Process
Usama M. Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. 1996. From data mining to knowledge discovery: an overview.
CRISP-DM Reference Model
https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining
Tasks in the CRISP-DM Model
Agenda
Regression
Classification
Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Addison Wesley, 2006
+ + ++
Data Mining Tasks +
+ + + ++
+ +
+
+
Regression
Classification
Clustering
Group points such that
—Data points in one cluster are more similar to one another.
—Data points in separate clusters are less similar to one another.
Goal: Reduce the data size for predictive Approach: Group data given a subset of
models. the available information and then use
the group label instead of the original
data as input for predictive models.
+ + ++
Data Mining Tasks +
+ + + ++
+ +
+
+
Regression
Classification
Association Rule Discovery
TID Items
1 Bread, Coke, Milk
2 Beer, Bread {Milk} → {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} → {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Management
+ + ++
Data Mining Tasks +
+ + + ++
+ +
+
+
Regression
Classification
Regression
▪ Predict a value of a given
continuous valued variable
based on the values of other
variables, assuming a linear or
nonlinear model of
dependency.
▪ Studied in statistics and
econometrics.
Applications:
▪ Predicting sales amounts of new product based on advertising
expenditure.
▪ Predicting wind velocities as a function of temperature, humidity, air
pressure, etc.
▪ Time series prediction of stock market indices (autoregressive models).
+ + ++
Data Mining Tasks +
+ + + ++
+ +
+
+
Regression
Classification
Classification
Find a model for the class attribute as a function of the values of other
attributes/features.
Training
Learn
Model
Set Classifier
Classification
Find a model for the class attribute as a function of the values of other
attributes/features.
Training
Learn
Model
Set Classifier
▪ Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new
product.
Classification: ▪ Approach:
— Use the data for a similar product introduced before or from a focus group.
Direct We have customer information (e.g., demographics, lifestyle, previous
purchases) and know which customers decided to buy and which decided
Marketing otherwise. This buy/don’t buy decision forms the class attribute.
— Use this information as input attributes to learn a classifier model.
— Apply the model to new customers to predict if they will buy the product.
▪ Goal: To predict whether a customer is likely to be lost to a
competitor.
▪ Approach:
Classification: —Use detailed record of transactions with each of the past and
present customers, to find attributes (frequency, recency,
Customer complaints, demographics, etc.).
Attrition/Churn —Label the customers as loyal or disloyal.
—Find a model for disloyalty.
—Rank each customer on a loyal/disloyal scale (e.g., churn
probability).
Classification: Sky
Survey Cataloging
▪ Goal: To predict class (star or
galaxy) of sky objects, especially
visually faint ones, based on the
telescopic survey images (from
Palomar Observatory).
▪ Approach:
—Segment the image to identify
objects.
—Derive features per object (40).
—Use known objects to model
the class based on these
features.
▪ Result: Found 16 new high
red-shift quasars.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
+ + ++
Data Mining Tasks +
+ + + ++
+ +
+
+
Regression
Classification
Deviation/Anomaly Detection
▪ Applications:
—Credit Card Fraud Detection
—Network Intrusion
Detection
Text mining –
Data stream
document Graph mining –
mining/real time
clustering, topic social networks
data mining
models
Mining
spatiotemporal Distributed data
Visual data mining
data (e.g., moving mining
objects)
Challenges of Data Mining
Scalability
Data
ownership and Dimensionality
privacy
Complexity
and
Data quality
heterogeneous
data
Agenda
https://rayli.net/blog/data/history-of-data-mining/
Relationship to other Fields
Artificial Supervised
Intelligence Learning
Machine Unsupervised
Optimization Learning Learning
Statistical
Data Learning
Reinforcement
Statistics Mining Learning
Online
Learning
+ Application Areas
Math
Relationship to other Fields
Method Learning Strategy
Artificial Supervised
Intelligence Learning
Machine Unsupervised
Optimization Learning Learning
Statistical
Data Learning
Reinforcement
Statistics Mining Learning
Online
Learning
Artificial Supervised
Intelligence Learning
Machine Unsupervised
Optimization Learning Learning
Statistical
Data Learning
Reinforcement
Statistics Mining Learning
Online
Learning
Artificial Supervised
Intelligence Learning
Machine Unsupervised
Optimization Learning Learning
Statistical
Data Learning
Reinforcement
Statistics Mining Learning
Online
Learning
Artificial Supervised
Intelligence Learning
Machine Unsupervised
Optimization Learning Learning
Statistical
Data Learning
Reinforcement
Statistics Mining Learning
Online
Learning
Artificial Supervised
Intelligence Learning
Machine Unsupervised
Optimization Learning Learning
Statistical
Data Learning
Reinforcement
Statistics Mining Learning
Online
Learning
Statistical learning: deals with the problem of finding a predictive function based
on data.
Tools: (Linear) classifiers, regression and regularization.
Relationship to other Fields
Method Learning Strategy
Artificial Supervised
Intelligence Learning
Machine Unsupervised
Optimization Learning Learning
Statistical
Data Learning
Reinforcement
Statistics Mining Learning
Online
Learning
Machine Learning involves the study of algorithms that can extract information
automatically, i.e., without on-line human guidance.
Techniques: Focus on supervised learning.
Relationship to other Fields
Method Learning Strategy
Artificial Supervised
Intelligence Learning
Machine Unsupervised
Optimization Learning Learning
Statistical
Data Learning
Reinforcement
Statistics Mining Learning
Online
Learning
Data Mining: Manually analyze a given dataset to gain insights and predict potential
outcomes.
Techniques: Any applicable technique from databases, statistics, machine/statistical
learning. New methods were developed by the Data Mining community.
Data Mining & Analytics
12
OR
10
8
Data Mining / Stats
Column 1
6 Statistics Column 2
Column 3
4
OR
Machine Learning
2
0
Row 1 Row 2 Row 3 Row 4
DB / CS
Prescriptive Analytics
What decisions should we make now
to achieve the best future outcome?
Predictive
Data Model
Optimize
Issues:
- What are the decision variables? Causality?
- Relationship can be non-linear. Convex?
- Uncertainty about quality and reliability of the predictive model.
Data
Science
https://www.kdnuggets.com/polls/
Tools: Types
Simple
Process
graphical user
oriented
interface
Programming
oriented
Tools: Simple GUI
▪ Weka: Waikato
Environment for
Knowledge Analysis (Java
API)
▪ SAS Enterprise
Miner
▪ IBM SPSS
Modeler
▪ RapidMiner
▪ Knime
▪ Orange
Tools: Programming oriented
▪R
—Rattle for beginners
—RStudio IDE, markdown, shiny
—Microsoft Open R
▪ Python
—Numpy, scikit-learn, pandas
—Jupyter notebook
http://www.fulcrumlogic.com/data_warehousing.shtml
Data Warehouse
Operations:
⚫ Slice
Smart phones
⚫ Dice
⚫ Drill-down
Product ⚫ Roll-up
⚫ Pivot
TX
Region
Computation
Distributed
Computation
Distributed MapReduce
Agenda
?
Legal, Privacy and Security Issues
BERLIN — Angry Birds, the top-selling paid mobile app for the iPhone
in the United States and Europe, has been downloaded more than a
billion times by devoted game players around the world, who often
spend hours slinging squawking fowl at groups of egg-stealing pigs.
When Jason Hong, an associate professor at the Human-Computer
Interaction Institute at Carnegie Mellon University, surveyed 40 users,
all but two were unaware that the game was storing their locations so
that they could later be the targets of ads....
Here is what the small print says...
USA Today Network Josh Hafner, 2:38 p.m. EDT July 13, 2016