Chap1 Intro

Data Mining: Introduction
by Michael Hahsler
Agenda
▪ What is Data Mining?

▪ Data Mining Tasks
▪ Relationship to Statistics,
Optimization, Machine Learning
and AI
▪ Tools
▪ Data
▪ Legal, Privacy and Security Issues
What is Data Mining?
One of many definitions:
"Data mining is the science of extracting useful

knowledge from huge data repositories."
ACM SIGKDD, Data Mining Curriculum: A Proposal
http://www.kdd.org/curriculum
Why Data Mining?
Commercial Viewpoint
▪ Businesses collect and warehouse lots
of data.
—Purchases at department/grocery stores
—Bank/credit card transactions
—Web and social media data
—Mobile and IOT
▪ Computers are cheaper and more
powerful.
▪ Competition to provide better services.
—Mass customization and recommendation
systems
—Targeted advertising
—Improved logistics
Why Mine Data?
Scientific Viewpoint
▪ Data collected and stored at
enormous speeds (GB/hour)
—remote sensors on a satellite
—telescopes scanning the skies
—microarrays generating gene
expression data
—scientific simulations
generating terabytes of data
▪ Data mining may help scientists
—identify patterns and relationships
—to classify and segment data
—formulate hypotheses
Knowledge Discovery in Databases (KDD) Process
Data normalization Decide on task & algorithm

Noise/outliers Performance?
Missing data Data/dim. reduction
Understand domain Features engineering
Feature selection
Usama M. Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. 1996. From data mining to knowledge discovery: an overview.
CRISP-DM Reference Model
▪ Cross Industry Standard

Process for Data Mining
▪ Open standard process model
▪ Industry, tool and application
neutral
▪ Defines tasks and outputs.
▪ Now developed by IBM as the
Analytics Solutions Unified
Method for Data
Mining/Predictive Analytics
(ASUM-DM).
▪ SAS has SEMMA and most
consulting companies use
their own similar process.
https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining
Tasks in the CRISP-DM Model
Agenda

and AI
▪ Tools
▪ Data
Data Mining Tasks
Descriptive Find human-interpretable

patterns that describe the
Methods data.
Predictive Use some features (variables)

to predict unknown or future
Methods value of other variable.
+ + ++
Data Mining Tasks +
+ + + ++
+ +
+
+
Regression
Classification
Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Addison Wesley, 2006
+ + ++
Data Mining Tasks +
+ + + ++
+ +
+
+
Regression
Classification
Clustering
Group points such that
—Data points in one cluster are more similar to one another.
—Data points in separate clusters are less similar to one another.
Ideal grouping is not known → Unsupervised Learning

Intracluster distances Intercluster distances
are minimized are maximized
Euclidean distance based clustering in 3-D space.

Clustering: Market Segmentation
Goal: subdivide a market into Approach:

distinct subsets of customers. 1. Collect different attributes of
Use a different marketing mix customers based on their geographical
for each segment. and lifestyle related information and
observed buying patterns.
2. Find clusters of similar customers.
Clustering Documents
Goal: Find groups of Approach: Identify Gain: Can be used to

documents that are similar to frequently occurring terms in organize documents or to
each. each document. Define a create recommendations.
similarity measure based on
term co-occurrences. Use it
to cluster.
Clustering: Data Reduction
Goal: Reduce the data size for predictive Approach: Group data given a subset of
models. the available information and then use
the group label instead of the original
data as input for predictive models.
+ + ++
Data Mining Tasks +
+ + + ++
+ +
+
+
Regression
Classification
Association Rule Discovery
▪ Given is a set of transactions. Each contains a

number of items.
▪ Produce dependency rules of the form
LHS → RHS
▪ which indicate that if the set of items in the LHS
are in a transaction, then the transaction likely will
also contain the RHS item.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread {Milk} → {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} → {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Transaction data Discovered Rules

Association Rule
Discovery
Marketing and
Sales Promotion
▪ Let the rule discovered be
{Potato Chips, … } → {Soft drink}
▪ Soft drink as RHS: What should be done

to boost sales? Discount Potato Chips?
▪ Potato Chips in LHS: Shows which
products would be affected if the store
discontinues selling Potato Chips.
▪ Potato Chips in LHS and Soft drink in
RHS: What products should be sold with
Potato Chips to promote sales of Soft
drinks!
▪ Goal: To identify items
Association that are bought together by
sufficiently many customers.
Rule Discovery ▪ Approach:
—Process the point-of-sale data
Supermarket to find dependencies among items.
shelf —Place dependent items
• close to each other (convenience).
management • far from each other to expose the customer to the maximum
number of products in the store.
Association
Rule ▪ Goal: Anticipate the nature of repairs to keep the service vehicles
equipped with right parts to speed up repair time.
Discovery ▪ Approach: Process the data on tools and parts required in
previous repairs at different consumer locations and discover co-
Inventory occurrence patterns.
Management
+ + ++
Data Mining Tasks +
+ + + ++
+ +
+
+
Regression
Classification
Regression
▪ Predict a value of a given
continuous valued variable
based on the values of other
variables, assuming a linear or
nonlinear model of
dependency.
▪ Studied in statistics and
econometrics.
Applications:
▪ Predicting sales amounts of new product based on advertising
expenditure.
▪ Predicting wind velocities as a function of temperature, humidity, air
pressure, etc.
▪ Time series prediction of stock market indices (autoregressive models).
+ + ++
Data Mining Tasks +
+ + + ++
+ +
+
+
Regression
Classification
Classification
Find a model for the class attribute as a function of the values of other
attributes/features.
Class information is available → Supervised Learning
Tid Refund Marital Taxable

Cheat
Status Income
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10
10 No Single 90K Yes
Training
Learn
Model
Set Classifier
Classification
Find a model for the class attribute as a function of the values of other
attributes/features.
Goal: assign new records to a class as accurately as possible.
Refund Marital Taxable

Cheat
Status Income
No Single 75K ?
Tid Refund Marital Taxable
Cheat Yes Married 50K ?
Status Income
No Married 150K ?
1 Yes Single 125K No
2 No Married 100K No Yes Divorced 90K ?
3 No Single 70K No No Single 40K ?
4 Yes Married 120K No No Married 80K ?
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes Test Set
Training
Learn
Model
Set Classifier
▪ Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new
product.
Classification: ▪ Approach:
— Use the data for a similar product introduced before or from a focus group.
Direct We have customer information (e.g., demographics, lifestyle, previous
purchases) and know which customers decided to buy and which decided
Marketing otherwise. This buy/don’t buy decision forms the class attribute.
— Use this information as input attributes to learn a classifier model.
— Apply the model to new customers to predict if they will buy the product.
▪ Goal: To predict whether a customer is likely to be lost to a
competitor.
▪ Approach:
Classification: —Use detailed record of transactions with each of the past and
present customers, to find attributes (frequency, recency,
Customer complaints, demographics, etc.).
Attrition/Churn —Label the customers as loyal or disloyal.
—Find a model for disloyalty.
—Rank each customer on a loyal/disloyal scale (e.g., churn
probability).
Classification: Sky
Survey Cataloging
▪ Goal: To predict class (star or
galaxy) of sky objects, especially
visually faint ones, based on the
telescopic survey images (from
Palomar Observatory).
▪ Approach:
—Segment the image to identify
objects.
—Derive features per object (40).
—Use known objects to model
the class based on these
features.
▪ Result: Found 16 new high
red-shift quasars.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
+ + ++
Data Mining Tasks +
+ + + ++
+ +
+
+
Regression
Classification
Deviation/Anomaly Detection
▪ Detect significant deviations from normal behavior.
▪ Applications:
—Credit Card Fraud Detection
—Network Intrusion
Detection
Typical network traffic at University

level may reach over 100 million
connections per day
Other Data Mining Tasks
Text mining –
Data stream
document Graph mining –
mining/real time
clustering, topic social networks
data mining
models
Mining
spatiotemporal Distributed data
Visual data mining
data (e.g., moving mining
objects)
Challenges of Data Mining
Scalability
Data
ownership and Dimensionality
privacy
Complexity
and
Data quality
heterogeneous
data
Agenda

and AI
▪ Tools
▪ Data
Origins of
Data Mining
▪ Draws ideas from AI,
machine learning, pattern
recognition, statistics, and AI
database systems.
Machine Learning (1959-)
▪ There are differences in
terms of
—used data and
—the goals.
Chief Data Scientist,

White House
https://rayli.net/blog/data/history-of-data-mining/
Relationship to other Fields
Method Learning Strategy
Artificial Supervised
Intelligence Learning
Machine Unsupervised
Optimization Learning Learning
Statistical
Data Learning
Reinforcement
Statistics Mining Learning
Online
Learning
+ Application Areas
Math
Statistical
Data Learning
Reinforcement
Online
Learning
Artificial Intelligence: Create an autonomous agent that perceives its environment

and takes actions that maximize its chance of reaching some goal.
Areas: reasoning, knowledge representation, planning, learning, natural language
processing, and vision.
Statistical
Data Learning
Reinforcement
Online
Learning
Optimization: Selection of a best alternative from some set of available alternatives

with regard to some criterion.
Techniques: Linear programming, integer programming, nonlinear programming,
stochastic and robust optimization, heuristics, etc.
Statistical
Data Learning
Reinforcement
Online
Learning
Statistics: Study of the collection, analysis, interpretation, presentation, and

organization of data.
Techniques: Descriptive statistics, statistical inference (estimation, testing), design of
experiments.
Statistical
Data Learning
Reinforcement
Online
Learning
Learning Strategy: From what data do we learn?

⚫ Is a training set with correct answers available? → Supervised learning
⚫ Long-term structure of rewards? → Reinforcement learning
⚫ No answer and no reward structure? → Unsupervised learning
⚫ Do we have to update the model regularly? → Online learning
Statistical
Data Learning
Reinforcement
Online
Learning
Statistical learning: deals with the problem of finding a predictive function based
on data.
Tools: (Linear) classifiers, regression and regularization.
Statistical
Data Learning
Reinforcement
Online
Learning
Machine Learning involves the study of algorithms that can extract information
automatically, i.e., without on-line human guidance.
Techniques: Focus on supervised learning.
Statistical
Data Learning
Reinforcement
Online
Learning
Data Mining: Manually analyze a given dataset to gain insights and predict potential
outcomes.
Techniques: Any applicable technique from databases, statistics, machine/statistical
learning. New methods were developed by the Data Mining community.
Data Mining & Analytics
12
OR
10
8
Data Mining / Stats
Column 1
6 Statistics Column 2
Column 3
4
OR
Machine Learning
2
0
Row 1 Row 2 Row 3 Row 4
DB / CS
Prescriptive Analytics
What decisions should we make now
to achieve the best future outcome?
Predictive
Data Model
Predict what Evaluate

Predict what Predict what
will happen
will happen Decision will change
predicted
in the future outcomes
Optimize
Issues:
- What are the decision variables? Causality?
- Relationship can be non-linear. Convex?
- Uncertainty about quality and reliability of the predictive model.
Data
Science
Source: T. Stadelmann, et al., Applied Data Science in Europe
Good luck finding this person!

Probably a team effort!
Agenda

and AI
▪ Tools
▪ Data
Tools: Commercial Players
Gartner MQ for Data

Science and Machine
Learning Platforms,
2020 vs 2019
changes.
Tools: Popularity
https://www.kdnuggets.com/polls/
Tools: Types
Simple
Process
graphical user
oriented
interface
Programming
oriented
Tools: Simple GUI
▪ Weka: Waikato
Environment for
Knowledge Analysis (Java
API)
▪ Rattle: GUI for Data

Mining using R
Tools: Process oriented
▪ SAS Enterprise
Miner
▪ IBM SPSS
Modeler
▪ RapidMiner
▪ Knime
▪ Orange
Tools: Programming oriented
▪R
—Rattle for beginners
—RStudio IDE, markdown, shiny
—Microsoft Open R
▪ Python
—Numpy, scikit-learn, pandas
—Jupyter notebook
→ Both have similar capabilities. Slightly different focus:

—R: statistical computing and visualization
—Python: Scripting, big data
—Interoperability via rpy2 and rediculate
https://www.dataquest.io/blog/python-vs-r/
Agenda

and AI
▪ Tools
▪ Data
Data
Data Warehouse
http://www.fulcrumlogic.com/data_warehousing.shtml
Data Warehouse
▪ Subject Oriented: Data warehouses are designed to help you analyze

data (e.g., sales data is organized by product and customer).
▪ Integrated: Integrates data from disparate sources into a consistent
format.
▪ Nonvolatile: Data in the data warehouse are never overwritten or
deleted.
▪ Time Variant: maintains both historical and (nearly) current data.
ETL: Extract, Transform and Load
▪ Extracting data from outside

sources
▪ Transforming data to fit
analytical needs. E.g.,
—Clean missing data, wrong data,
etc.
—Normalize and translate
(e.g., 1 → "female")
—Join from several sources
Source: SAS, ETL: What it is and why it matters
—Calculate and aggregate data
▪ Loading data into the data
warehouse
OnLine Analytical Processing (OLAP)
Operations:
⚫ Slice
Smart phones
⚫ Dice
⚫ Drill-down
Product ⚫ Roll-up
⚫ Pivot
TX
Region
Store data in "data cubes" for fast OLAP operations.

Requires a special database structure (Snow-flake scheme).
Big Data
▪ "Big data is a term for data sets
that are so large or complex that
traditional data processing applications are inadequate to deal with
them." Wikipedia
▪ 3 V's: Volume, velocity, variety, (veracity) Gartner
Computation
Distributed
Computation
Distributed MapReduce
Agenda

and AI
▪ Tools
▪ Data
Legal, Privacy and Security Issues
?
Are we Are we Is privacy Is it ethical

allowed to allowed to preserved to use and
collect the use the in the act on the
data? data? process? data?
Problem: Internet is global, but legislation is local!

Data-Gathering via Apps

Presents a Gray Legal Area
By KEVIN J. O’BRIEN
Published: October 28, 2012
BERLIN — Angry Birds, the top-selling paid mobile app for the iPhone
in the United States and Europe, has been downloaded more than a
billion times by devoted game players around the world, who often
spend hours slinging squawking fowl at groups of egg-stealing pigs.
When Jason Hong, an associate professor at the Human-Computer
Interaction Institute at Carnegie Mellon University, surveyed 40 users,
all but two were unaware that the game was storing their locations so
that they could later be the targets of ads....
Here is what the small print says...
USA Today Network Josh Hafner, 2:38 p.m. EDT July 13, 2016
Pokémon Go’s constant location tracking and camera access required

for gameplay, paired with its skyrocketing popularity, could provide
data like no app before it.
“Their privacy policy is vague,” Hong said. “I’d say deliberately vague,
because of the lack of clarity on the business model.”
...
The agreement says Pokémon Go collects data about its users as a
“business asset.” This includes data used to personally identify players
such as email addresses and other information pulled from Google and
Facebook accounts players use to sign up for the game.
If Niantic is ever sold, the agreement states, all that data can go to
another company.
Conclusion
Data Mining is Data Mining requires a team

interdisciplinary and effort with members who
overlaps significantly with have expertise in several
many fields areas
• Statistics • Data management
• CS (machine learning, AI, • Statistics
data bases) • Programming
• Optimization • Communication
• + Application domain

Chap1 Intro

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chap1 Intro

Uploaded by

Copyright:

Available Formats

Data Mining: Introduction

▪ What is Data Mining?

One of many definitions:

"Data mining is the science of extracting useful

ACM SIGKDD, Data Mining Curriculum: A Proposal

Data normalization Decide on task & algorithm

▪ Cross Industry Standard

▪ What is Data Mining?

Descriptive Find human-interpretable

Predictive Use some features (variables)

Ideal grouping is not known → Unsupervised Learning

Euclidean distance based clustering in 3-D space.

Goal: subdivide a market into Approach:

Goal: Find groups of Approach: Identify Gain: Can be used to

▪ Given is a set of transactions. Each contains a

Transaction data Discovered Rules

▪ Let the rule discovered be

{Potato Chips, … } → {Soft drink}

▪ Soft drink as RHS: What should be done

Class information is available → Supervised Learning

Tid Refund Marital Taxable

Goal: assign new records to a class as accurately as possible.

Refund Marital Taxable

▪ Detect significant deviations from normal behavior.

Typical network traffic at University

▪ What is Data Mining?

Chief Data Scientist,

Method Learning Strategy

Artificial Intelligence: Create an autonomous agent that perceives its environment

Optimization: Selection of a best alternative from some set of available alternatives

Statistics: Study of the collection, analysis, interpretation, presentation, and

Learning Strategy: From what data do we learn?

Predict what Evaluate

Source: T. Stadelmann, et al., Applied Data Science in Europe

Good luck finding this person!

▪ What is Data Mining?

Gartner MQ for Data

▪ Rattle: GUI for Data

→ Both have similar capabilities. Slightly different focus:

▪ What is Data Mining?

▪ Subject Oriented: Data warehouses are designed to help you analyze

▪ Extracting data from outside

Store data in "data cubes" for fast OLAP operations.

▪ What is Data Mining?

Are we Are we Is privacy Is it ethical

Problem: Internet is global, but legislation is local!

Data-Gathering via Apps

Pokémon Go’s constant location tracking and camera access required

Data Mining is Data Mining requires a team

You might also like