Professional Documents
Culture Documents
Data Warehousing & Data Mining Chapter 4
Data Warehousing & Data Mining Chapter 4
Data Mining
TBS 2020-2021
2
What is data mining
▪ We can have the following types of models
▪ Models that explain the data (e.g., a single
function)
▪ Models that predict the future data instances.
▪ Models that summarize the data
▪ Models that extract the most prominent
features of the data.
3
What is data mining
▪ Data mining is used today by companies with a strong
consumer focus - retail, financial, communication, and
marketing organizations.
6
The Knowledge Discovery in Databases
(KDD)
The Knowledge Discovery in Databases process comprises
of a few steps leading from raw data collections to some
form of new knowledge.
The iterative process consists of the following steps:
▪ Data cleaning: also known as data cleansing, it is a
phase in which noise data and irrelevant data are
removed from the collection.
▪ Data integration: at this stage, multiple data sources,
often heterogeneous, may be combined in a common
source.
▪ Data selection: at this step, the data relevant to the
analysis is decided on and retrieved from the data
collection.
7
What is data mining
▪ Data transformation: also known as data
consolidation, it is a phase in which the selected data is
transformed into forms appropriate for the mining
procedure.
▪ Data mining: it is the crucial step in which clever
techniques are applied to extract patterns potentially
useful.
▪ Pattern evaluation: in this step, strictly interesting
patterns representing knowledge are identified based
on given measures.
▪ Knowledge representation: is the final phase in which
the discovered knowledge is visually represented to
the user. This essential step uses visualization
techniques to help users understand and interpret the
data mining results. 8
Steps of a KDD Process
▪ Learning the application domain:
– relevant prior knowledge and goals of application
▪ Creating a target data set: data selection
▪ Data cleaning and preprocessing: (may take 60% of effort!)
▪ Data reduction and transformation:
– Find useful features, dimensionality/variable reduction,
invariant representation.
▪ Choosing functions of data mining
– summarization, classification, regression, association, clustering.
▪ Choosing the mining algorithm(s)
▪ Data mining: search for patterns of interest
▪ Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
9
▪ Use of discovered knowledge
Main data mining tasks
▪ Classification:
mining patterns that can classify future data
into known classes.
▪ Association rule mining
mining any rule of the form X → Y, where X
and Y are sets of data items.
▪ Clustering
identifying a set of similarity groups in the data
10
Main data mining tasks
▪ Sequential pattern mining:
A sequential rule: A→ B, says that event A
will be immediately followed by event B with
a certain confidence
▪ Deviation detection:
discovering the most significant changes in
data
▪ Data visualization:
using graphical methods to show
patterns in data.
11
Why is data mining necessary?
▪ Make use of your data assets
▪ There is a big gap from stored data to
knowledge; and the transition won’t occur
automatically.
▪ Many interesting things you want to find
cannot be found using database queries
“find me people likely to buy my products”
“Who are likely to respond to my promotion”
12
Data mining applications
▪ Marketing,
customer profiling and retention, identifying
potential customers, market segmentation.
▪ Fraud detection
identifying credit card fraud, intrusion detection
▪ Scientific data analysis
▪ Text and web mining
▪ Any application that involves a large
amount of data …
13
Data mining functions
▪ Association rules
▪ Sequence mining
▪ Classification(decision tree etc.)
▪ Clustering
▪ Deviation detection
14
Data mining techniques
Many methods, such as
▪ Decision trees
▪ K-nearest neighbours
▪ Neural networks
▪ Genetic algorithms
▪ Hidden markov models
▪ Time series
▪ Bayesian networks
▪ Rough and fuzzy sets
15
Predictive modeling
▪ A “black box” that makes predictions about
the future based on information from the
past and present
Age
CarType
16
Models
• Some models are better than others
– Accuracy
– Understandability
• Models range from easy to understand to
incomprehensible
– Decision trees Easier
– Rule induction
– Regression models
– Neural networks
Harder
17
Supervised vs. Unsupervised
Learning
Unsupervised
system
Supervised
system
Data Mining
18
Supervised vs. Unsupervised Learning
▪ Supervised learning (classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
– New data is classified based on the training set
▪ Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
19
Supervised Learning
▪ Discover patterns in the data that relate data
attributes with a target (class) attribute.
• These patterns are then utilized to predict the
values of the target attribute in future data
instances.
▪ Supervised techniques
• Decision Tree
• Bayesian networks
• Classification rules
• Neural networks
20
Supervised Learning
The data and the goal
▪ Data: A set of data records (also called examples,
instances or cases) described by
– k attributes: A1, A2, … Ak.
– a class: Each example is labelled with a pre-
defined class.
▪ Goal: To learn a classification model from the data
that can be used to predict the classes of new
(future, or test) cases/instances.
21
Supervised Learning
Object O
Attributes A1 A2 …………………… AK
(Variables)
Supervised method
23
Unsupervised Learning
Object O
Attributes A1 A2 …………………… AK
(Variables)
Unsupervised method
Measures
????
Results
24
Unsupervised Learning
▪ Unsupervised techniques
• Clustering
• Associations rules
• Neural networks
25
Recap: Data mining process (KDD)
Interpretation
Data Mining
Transformation
Preprocessing Knowledge
Selection
Patterns
Transformed
Preprocessed Data
Target Data
Original
Data Data
26
Recap: Data mining goals
▪ Prediction
– What? Opaque
▪ Description
– Why? Transparent
▪ Data mining vs. Statistical
– Discover rather than check
▪ Data mining vs. machine learning
– Manipulating huge DB rather than "small"
training set
27
“If
you torture the data long
enough, it will confess”
Ronald Coase
Nobel Prize in Economics, 1991
28