Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

Data Mining: Introduction

by Michael Hahsler
Agenda

▪ What is Data Mining?


▪ Data Mining Tasks
▪ Relationship to Statistics,
Optimization, Machine Learning
and AI
▪ Tools
▪ Data
▪ Legal, Privacy and Security Issues
What is Data Mining?

One of many definitions:

"Data mining is the science of extracting useful


knowledge from huge data repositories."

ACM SIGKDD, Data Mining Curriculum: A Proposal

http://www.kdd.org/curriculum
Why Data Mining?
Commercial Viewpoint
▪ Businesses collect and warehouse lots
of data.
—Purchases at department/grocery stores
—Bank/credit card transactions
—Web and social media data
—Mobile and IOT
▪ Computers are cheaper and more
powerful.
▪ Competition to provide better services.
—Mass customization and recommendation
systems
—Targeted advertising
—Improved logistics
Why Mine Data?
Scientific Viewpoint
▪ Data collected and stored at
enormous speeds (GB/hour)
—remote sensors on a satellite
—telescopes scanning the skies
—microarrays generating gene
expression data
—scientific simulations
generating terabytes of data
▪ Data mining may help scientists
—identify patterns and relationships
—to classify and segment data
—formulate hypotheses
Knowledge Discovery in Databases (KDD) Process

Data normalization Decide on task & algorithm


Noise/outliers Performance?
Missing data Data/dim. reduction
Understand domain Features engineering
Feature selection

Usama M. Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. 1996. From data mining to knowledge discovery: an overview.
CRISP-DM Reference Model

▪ Cross Industry Standard


Process for Data Mining
▪ Open standard process model
▪ Industry, tool and application
neutral
▪ Defines tasks and outputs.
▪ Now developed by IBM as the
Analytics Solutions Unified
Method for Data
Mining/Predictive Analytics
(ASUM-DM).
▪ SAS has SEMMA and most
consulting companies use
their own similar process.

https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining
Tasks in the CRISP-DM Model
Agenda

▪ What is Data Mining?


▪ Data Mining Tasks
▪ Relationship to Statistics,
Optimization, Machine Learning
and AI
▪ Tools
▪ Data
▪ Legal, Privacy and Security Issues
Data Mining Tasks

Descriptive Find human-interpretable


patterns that describe the
Methods data.

Predictive Use some features (variables)


to predict unknown or future
Methods value of other variable.
+ + ++
Data Mining Tasks +
+ + + ++
+ +
+
+

Regression

Classification

Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Addison Wesley, 2006
+ + ++
Data Mining Tasks +
+ + + ++
+ +
+
+

Regression

Classification
Clustering
Group points such that
—Data points in one cluster are more similar to one another.
—Data points in separate clusters are less similar to one another.

Ideal grouping is not known → Unsupervised Learning


Intracluster distances Intercluster distances
are minimized are maximized

Euclidean distance based clustering in 3-D space.


Clustering: Market Segmentation

Goal: subdivide a market into Approach:


distinct subsets of customers. 1. Collect different attributes of
Use a different marketing mix customers based on their geographical
for each segment. and lifestyle related information and
observed buying patterns.
2. Find clusters of similar customers.
Clustering Documents

Goal: Find groups of Approach: Identify Gain: Can be used to


documents that are similar to frequently occurring terms in organize documents or to
each. each document. Define a create recommendations.
similarity measure based on
term co-occurrences. Use it
to cluster.
Clustering: Data Reduction

Goal: Reduce the data size for predictive Approach: Group data given a subset of
models. the available information and then use
the group label instead of the original
data as input for predictive models.
+ + ++
Data Mining Tasks +
+ + + ++
+ +
+
+

Regression

Classification
Association Rule Discovery

▪ Given is a set of transactions. Each contains a


number of items.
▪ Produce dependency rules of the form
LHS → RHS
▪ which indicate that if the set of items in the LHS
are in a transaction, then the transaction likely will
also contain the RHS item.

TID Items
1 Bread, Coke, Milk
2 Beer, Bread {Milk} → {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} → {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

Transaction data Discovered Rules


Association Rule
Discovery
Marketing and
Sales Promotion

▪ Let the rule discovered be

{Potato Chips, … } → {Soft drink}

▪ Soft drink as RHS: What should be done


to boost sales? Discount Potato Chips?
▪ Potato Chips in LHS: Shows which
products would be affected if the store
discontinues selling Potato Chips.
▪ Potato Chips in LHS and Soft drink in
RHS: What products should be sold with
Potato Chips to promote sales of Soft
drinks!
▪ Goal: To identify items
Association that are bought together by
sufficiently many customers.
Rule Discovery ▪ Approach:
—Process the point-of-sale data
Supermarket to find dependencies among items.
shelf —Place dependent items
• close to each other (convenience).
management • far from each other to expose the customer to the maximum
number of products in the store.
Association
Rule ▪ Goal: Anticipate the nature of repairs to keep the service vehicles
equipped with right parts to speed up repair time.
Discovery ▪ Approach: Process the data on tools and parts required in
previous repairs at different consumer locations and discover co-
Inventory occurrence patterns.

Management
+ + ++
Data Mining Tasks +
+ + + ++
+ +
+
+

Regression

Classification
Regression
▪ Predict a value of a given
continuous valued variable
based on the values of other
variables, assuming a linear or
nonlinear model of
dependency.
▪ Studied in statistics and
econometrics.

Applications:
▪ Predicting sales amounts of new product based on advertising
expenditure.
▪ Predicting wind velocities as a function of temperature, humidity, air
pressure, etc.
▪ Time series prediction of stock market indices (autoregressive models).
+ + ++
Data Mining Tasks +
+ + + ++
+ +
+
+

Regression

Classification
Classification
Find a model for the class attribute as a function of the values of other
attributes/features.

Class information is available → Supervised Learning

Tid Refund Marital Taxable


Cheat
Status Income
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10
10 No Single 90K Yes

Training
Learn
Model
Set Classifier
Classification
Find a model for the class attribute as a function of the values of other
attributes/features.

Goal: assign new records to a class as accurately as possible.

Refund Marital Taxable


Cheat
Status Income
No Single 75K ?
Tid Refund Marital Taxable
Cheat Yes Married 50K ?
Status Income
No Married 150K ?
1 Yes Single 125K No
2 No Married 100K No Yes Divorced 90K ?
3 No Single 70K No No Single 40K ?
4 Yes Married 120K No No Married 80K ?
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes Test Set

Training
Learn
Model
Set Classifier
▪ Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new
product.
Classification: ▪ Approach:
— Use the data for a similar product introduced before or from a focus group.
Direct We have customer information (e.g., demographics, lifestyle, previous
purchases) and know which customers decided to buy and which decided
Marketing otherwise. This buy/don’t buy decision forms the class attribute.
— Use this information as input attributes to learn a classifier model.
— Apply the model to new customers to predict if they will buy the product.
▪ Goal: To predict whether a customer is likely to be lost to a
competitor.
▪ Approach:
Classification: —Use detailed record of transactions with each of the past and
present customers, to find attributes (frequency, recency,
Customer complaints, demographics, etc.).
Attrition/Churn —Label the customers as loyal or disloyal.
—Find a model for disloyalty.
—Rank each customer on a loyal/disloyal scale (e.g., churn
probability).
Classification: Sky
Survey Cataloging
▪ Goal: To predict class (star or
galaxy) of sky objects, especially
visually faint ones, based on the
telescopic survey images (from
Palomar Observatory).
▪ Approach:
—Segment the image to identify
objects.
—Derive features per object (40).
—Use known objects to model
the class based on these
features.
▪ Result: Found 16 new high
red-shift quasars.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
+ + ++
Data Mining Tasks +
+ + + ++
+ +
+
+

Regression

Classification
Deviation/Anomaly Detection

▪ Detect significant deviations from normal behavior.

▪ Applications:
—Credit Card Fraud Detection

—Network Intrusion
Detection

Typical network traffic at University


level may reach over 100 million
connections per day
Other Data Mining Tasks

Text mining –
Data stream
document Graph mining –
mining/real time
clustering, topic social networks
data mining
models

Mining
spatiotemporal Distributed data
Visual data mining
data (e.g., moving mining
objects)
Challenges of Data Mining

Scalability

Data
ownership and Dimensionality
privacy

Complexity
and
Data quality
heterogeneous
data
Agenda

▪ What is Data Mining?


▪ Data Mining Tasks
▪ Relationship to Statistics,
Optimization, Machine Learning
and AI
▪ Tools
▪ Data
▪ Legal, Privacy and Security Issues
Origins of
Data Mining
▪ Draws ideas from AI,
machine learning, pattern
recognition, statistics, and AI
database systems.
Machine Learning (1959-)
▪ There are differences in
terms of
—used data and
—the goals.

Chief Data Scientist,


White House

https://rayli.net/blog/data/history-of-data-mining/
Relationship to other Fields

Method Learning Strategy

Artificial Supervised
Intelligence Learning

Machine Unsupervised
Optimization Learning Learning
Statistical
Data Learning
Reinforcement
Statistics Mining Learning

Online
Learning
+ Application Areas

Math
Relationship to other Fields
Method Learning Strategy

Artificial Supervised
Intelligence Learning

Machine Unsupervised
Optimization Learning Learning
Statistical
Data Learning
Reinforcement
Statistics Mining Learning

Online
Learning

Artificial Intelligence: Create an autonomous agent that perceives its environment


and takes actions that maximize its chance of reaching some goal.
Areas: reasoning, knowledge representation, planning, learning, natural language
processing, and vision.
Relationship to other Fields
Method Learning Strategy

Artificial Supervised
Intelligence Learning

Machine Unsupervised
Optimization Learning Learning
Statistical
Data Learning
Reinforcement
Statistics Mining Learning

Online
Learning

Optimization: Selection of a best alternative from some set of available alternatives


with regard to some criterion.
Techniques: Linear programming, integer programming, nonlinear programming,
stochastic and robust optimization, heuristics, etc.
Relationship to other Fields
Method Learning Strategy

Artificial Supervised
Intelligence Learning

Machine Unsupervised
Optimization Learning Learning
Statistical
Data Learning
Reinforcement
Statistics Mining Learning

Online
Learning

Statistics: Study of the collection, analysis, interpretation, presentation, and


organization of data.
Techniques: Descriptive statistics, statistical inference (estimation, testing), design of
experiments.
Relationship to other Fields
Method Learning Strategy

Artificial Supervised
Intelligence Learning

Machine Unsupervised
Optimization Learning Learning
Statistical
Data Learning
Reinforcement
Statistics Mining Learning

Online
Learning

Learning Strategy: From what data do we learn?


⚫ Is a training set with correct answers available? → Supervised learning
⚫ Long-term structure of rewards? → Reinforcement learning
⚫ No answer and no reward structure? → Unsupervised learning
⚫ Do we have to update the model regularly? → Online learning
Relationship to other Fields
Method Learning Strategy

Artificial Supervised
Intelligence Learning

Machine Unsupervised
Optimization Learning Learning
Statistical
Data Learning
Reinforcement
Statistics Mining Learning

Online
Learning

Statistical learning: deals with the problem of finding a predictive function based
on data.
Tools: (Linear) classifiers, regression and regularization.
Relationship to other Fields
Method Learning Strategy

Artificial Supervised
Intelligence Learning

Machine Unsupervised
Optimization Learning Learning
Statistical
Data Learning
Reinforcement
Statistics Mining Learning

Online
Learning

Machine Learning involves the study of algorithms that can extract information
automatically, i.e., without on-line human guidance.
Techniques: Focus on supervised learning.
Relationship to other Fields
Method Learning Strategy

Artificial Supervised
Intelligence Learning

Machine Unsupervised
Optimization Learning Learning
Statistical
Data Learning
Reinforcement
Statistics Mining Learning

Online
Learning

Data Mining: Manually analyze a given dataset to gain insights and predict potential
outcomes.
Techniques: Any applicable technique from databases, statistics, machine/statistical
learning. New methods were developed by the Data Mining community.
Data Mining & Analytics
12

OR
10

8
Data Mining / Stats
Column 1
6 Statistics Column 2
Column 3

4
OR

Machine Learning
2

0
Row 1 Row 2 Row 3 Row 4
DB / CS
Prescriptive Analytics
What decisions should we make now
to achieve the best future outcome?

Predictive
Data Model

Predict what Evaluate


Predict what Predict what
will happen
will happen Decision will change
predicted
in the future outcomes

Optimize
Issues:
- What are the decision variables? Causality?
- Relationship can be non-linear. Convex?
- Uncertainty about quality and reliability of the predictive model.
Data
Science

Source: T. Stadelmann, et al., Applied Data Science in Europe

Good luck finding this person!


Probably a team effort!
Agenda

▪ What is Data Mining?


▪ Data Mining Tasks
▪ Relationship to Statistics,
Optimization, Machine Learning
and AI
▪ Tools
▪ Data
▪ Legal, Privacy and Security Issues
Tools: Commercial Players

Gartner MQ for Data


Science and Machine
Learning Platforms,
2020 vs 2019
changes.
Tools: Popularity

https://www.kdnuggets.com/polls/
Tools: Types

Simple
Process
graphical user
oriented
interface

Programming
oriented
Tools: Simple GUI

▪ Weka: Waikato
Environment for
Knowledge Analysis (Java
API)

▪ Rattle: GUI for Data


Mining using R
Tools: Process oriented

▪ SAS Enterprise
Miner
▪ IBM SPSS
Modeler
▪ RapidMiner
▪ Knime
▪ Orange
Tools: Programming oriented

▪R
—Rattle for beginners
—RStudio IDE, markdown, shiny
—Microsoft Open R

▪ Python
—Numpy, scikit-learn, pandas
—Jupyter notebook

→ Both have similar capabilities. Slightly different focus:


—R: statistical computing and visualization
—Python: Scripting, big data
—Interoperability via rpy2 and rediculate
https://www.dataquest.io/blog/python-vs-r/
Agenda

▪ What is Data Mining?


▪ Data Mining Tasks
▪ Relationship to Statistics,
Optimization, Machine Learning
and AI
▪ Tools
▪ Data
▪ Legal, Privacy and Security Issues
Data
Data Warehouse

http://www.fulcrumlogic.com/data_warehousing.shtml
Data Warehouse

▪ Subject Oriented: Data warehouses are designed to help you analyze


data (e.g., sales data is organized by product and customer).
▪ Integrated: Integrates data from disparate sources into a consistent
format.
▪ Nonvolatile: Data in the data warehouse are never overwritten or
deleted.
▪ Time Variant: maintains both historical and (nearly) current data.
ETL: Extract, Transform and Load

▪ Extracting data from outside


sources
▪ Transforming data to fit
analytical needs. E.g.,
—Clean missing data, wrong data,
etc.
—Normalize and translate
(e.g., 1 → "female")
—Join from several sources
Source: SAS, ETL: What it is and why it matters
—Calculate and aggregate data
▪ Loading data into the data
warehouse
OnLine Analytical Processing (OLAP)

Operations:

⚫ Slice
Smart phones
⚫ Dice
⚫ Drill-down
Product ⚫ Roll-up
⚫ Pivot

TX
Region

Store data in "data cubes" for fast OLAP operations.


Requires a special database structure (Snow-flake scheme).
Big Data
▪ "Big data is a term for data sets
that are so large or complex that
traditional data processing applications are inadequate to deal with
them." Wikipedia
▪ 3 V's: Volume, velocity, variety, (veracity) Gartner

Computation
Distributed
Computation

Distributed MapReduce
Agenda

▪ What is Data Mining?


▪ Data Mining Tasks
▪ Relationship to Statistics,
Optimization, Machine Learning
and AI
▪ Tools
▪ Data
▪ Legal, Privacy and Security Issues
Legal, Privacy and Security Issues

?
Legal, Privacy and Security Issues

Are we Are we Is privacy Is it ethical


allowed to allowed to preserved to use and
collect the use the in the act on the
data? data? process? data?

Problem: Internet is global, but legislation is local!


Legal, Privacy and Security Issues

Data-Gathering via Apps


Presents a Gray Legal Area
By KEVIN J. O’BRIEN
Published: October 28, 2012

BERLIN — Angry Birds, the top-selling paid mobile app for the iPhone
in the United States and Europe, has been downloaded more than a
billion times by devoted game players around the world, who often
spend hours slinging squawking fowl at groups of egg-stealing pigs.
When Jason Hong, an associate professor at the Human-Computer
Interaction Institute at Carnegie Mellon University, surveyed 40 users,
all but two were unaware that the game was storing their locations so
that they could later be the targets of ads....
Here is what the small print says...
USA Today Network Josh Hafner, 2:38 p.m. EDT July 13, 2016

Pokémon Go’s constant location tracking and camera access required


for gameplay, paired with its skyrocketing popularity, could provide
data like no app before it.
“Their privacy policy is vague,” Hong said. “I’d say deliberately vague,
because of the lack of clarity on the business model.”
...
The agreement says Pokémon Go collects data about its users as a
“business asset.” This includes data used to personally identify players
such as email addresses and other information pulled from Google and
Facebook accounts players use to sign up for the game.
If Niantic is ever sold, the agreement states, all that data can go to
another company.
Conclusion

Data Mining is Data Mining requires a team


interdisciplinary and effort with members who
overlaps significantly with have expertise in several
many fields areas
• Statistics • Data management
• CS (machine learning, AI, • Statistics
data bases) • Programming
• Optimization • Communication
• + Application domain

You might also like