Professional Documents
Culture Documents
Data Warehouse and Data Mining: Syllabus
Data Warehouse and Data Mining: Syllabus
UNIT I
Introduction: Fundamentals of data mining, Data Mining Functionalities, Classification of
Data Mining systems, Data Mining Task Primitives, Integration of a Data Mining System
Data Warehouse and Data Mining with a Database or a Data Warehouse System, Major issues in Data Mining. Data
Preprocessing: Need for Preprocessing the Data, Data Cleaning, Data Integration and
Transformation, Data Reduction, Discretization and Concept Hierarchy Generation.
UNIT II
Data Warehouse and OLAP Technology for Data Mining: Data Warehouse,
Multidimensional Data Model, Data Warehouse Architecture, Data Warehouse
Implementation, From Data Warehousing to Data Mining. Data Cube Computation and
DWDM Data Generalization: Efficient Methods for Data Cube Computation, Further
Development of Data Cube and OLAP Technology, Attribute-Oriented Induction.
1
About the Unit
Introduction
Fundamentals of data mining
Data Mining Functionalities
Classification of Data Mining systems
2
What Is Data Mining? Data Mining Activity or not ?
Dividing the product of a item according to their brand.
Data mining (knowledge discovery from data) Dividing the product of a item according to their profitability.
Extraction of interesting (non-trivial, implicit, previously unknown and Computing the total sales of a company
potentially useful) patterns or knowledge from huge amount of data Sorting a student database based on student identification numbers.
Predicting the future stock price of a company using historical
Alternative names records.
Knowledge discovery (mining) in databases (KDD), knowledge Predicting the sales of an item using historical records.
extraction, data/pattern analysis, data archeology, data dredging, Monitoring the heart rate of a patient for abnormalities
information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”?
Simple search and query processing
(Deductive) expert systems
The Explosive Growth of Data: from terabytes to petabytes Data analysis and decision support
Data collection and data availability Market analysis and management
Automated data collection tools, database systems, Web, computerized Target marketing, customer relationship management (CRM), market
society basket analysis, cross selling, market segmentation
Major sources of abundant data Risk analysis and management
Business: Web, e-commerce, transactions, stocks, … Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
Science: Remote sensing, bioinformatics, scientific simulation…
Fraud detection and detection of unusual patterns (outliers)
Society and everyone: news, digital cameras, YouTube
Other Applications
We are drowning in data, but starving for knowledge!
Text mining (news group, email, documents) and Web mining
“Necessity is the mother of invention”—Data mining—Automated analysis of
Stream data mining
massive data sets
Bioinformatics and bio-data analysis
3
Ex. 1: Market Analysis and Management Ex. 2: Corporate Analysis & Risk Management
Where does the data come from?—Credit card transactions, loyalty cards, Finance planning and asset evaluation
discount coupons, customer complaint calls, plus (public) lifestyle studies
cash flow analysis and prediction
Target marketing
contingent claim analysis to evaluate assets
Find clusters of “model” customers who share the same characteristics:
interest, income level, spending habits, etc. cross-sectional and time series analysis (financial-ratio, trend analysis,
Determine customer purchasing patterns over time etc.)
Cross-market analysis—Find associations/co-relations between product sales, & Resource planning
predict based on such association
summarize and compare the resources and spending
Customer profiling—What types of customers buy what products (clustering or
Competition
classification)
Customer requirement analysis monitor competitors and market directions
Identify the best products for different groups of customers group customers into classes and a class-based pricing procedure
Predict what factors will attract new customers set pricing strategy in a highly competitive market
Ex. 3: Fraud Detection & Mining Unusual Patterns Knowledge Discovery (KDD) Process
Knowledge Presentation
Approaches: Clustering & model construction for frauds, outlier analysis
Pattern Evaluation
Applications: Health care, retail, credit card service, telecomm. Data mining—core of
Auto insurance: ring of collisions knowledge discovery process
Money laundering: suspicious monetary transactions Data Mining
Medical insurance
Professional patients, ring of doctors, and ring of references Task-relevant Data
Unnecessary or correlated screening tests
Telecommunications: phone-call fraud Data Transformation & Data Reduction
Data Warehouse Data Selection
Phone call model: destination of the call, duration, time of day or
week. Analyze patterns that deviate from an expected norm
Data Cleaning
Retail industry
Analysts estimate that 38% of retail shrink is due to dishonest Data Integration
employees
Anti-terrorism
Databases
Prepared by S.Palaniappan, Assoc.Prof 15 Prepared by S.Palaniappan, Assoc.Prof 16
GRIET GRIET
4
Architecture of a Typical Data Mining Data Mining: On What Kinds of Data?
System In principle, data mining should be applicable to any kind of data repository, as
Graphical user interface well as to transient data, such as data streams
On What Kinds of Data Data warehouse (centralized data management and retrieval)
5
Stores sequences of values or events obtained over repeated
Transactional databases measurements of time (e.g., hourly, daily, weekly).
Stock exchange, inventory control, and the observation of natural
Consists of a file where each record represents a transaction phenomena
A transaction typically includes a unique transaction identity number
(trans ID) and a list of the items making up the transaction (such as Spatial database and Spatiotemporal Databases
items purchased in a store).
Such as relational or object-oriented databases, hierarchical Perform inference on the current data in order to make predictions
databases, network databases, spreadsheets, multimedia
databases • Classification • Regression
6
Data Mining Functionalities
Frequent patterns, association, correlation
Multidimensional concept description: Characterization and discrimination
Data can be associated with classes or concepts
Patterns that occur frequently in data
Classes of items for sale include computers and printers, and
Itemsets
Concepts of customers include bigSpenders and budgetSpenders.
Descriptions of a class or a concept are called class/concept
typically refers to a set of items that frequently appear together
descriptions. Subsequences
data characterization sequential pattern
It is a summarization of the general characteristics or features of a
and substructures
target class of data
data discrimination
different structural forms ( graphs, trees, lattices)
It is a comparison of the general features of target class data
objects with the general features of objects from one or a set of buys(X; “computer”)) buys(X; “software”) [support = 1%; confidence =
contrasting classes
50%]
Prepared by S.Palaniappan, Assoc.Prof 25 Prepared by S.Palaniappan, Assoc.Prof 26
GRIET GRIET
Cluster analysis
Classification and prediction Class label is unknown: Group data to form new classes, e.g.,
Construct models (functions) that describe and distinguish classes cluster houses to find distribution patterns
or concepts for future prediction Maximizing intra-class similarity & minimizing interclass similarity
The derived model is based on the analysis of a set of training data Outlier analysis
(class label is known) Outlier: Data object that does not comply with the general behavior
The derived model may be represented in various forms, such as of the data
classification (IF-THEN) rules, decision trees, mathematical rare events can be more interesting than the more regularly
formulae, or neural networks occurring ones
E.g., classify countries based on (climate), or classify cars Noise or exception? Useful in fraud detection, rare events analysis
based on (gas mileage) Trend and evolution analysis (whose behavior changes over time)
Predict some unknown or missing numerical values Trend and deviation: e.g., regression analysis
Sequential pattern mining: e.g., digital camera large SD memory
Models continuous-valued functions
Periodicity analysis
Regression analysis
Similarity-based analysis
7
Top-10 Algorithm Are All the “Discovered” Patterns Interesting?
#1 : C4.5 Data mining may generate thousands of patterns: Not all of them are
#2 : K-Means interesting
#3 : SVM
What makes a pattern interesting?
#4 : Apriori
Can a data mining system generate all of the interesting patterns?
#5 : EM - Expectation maximization
Can a data mining system generate only interesting patterns?”
#6 : PageRank
#7 : AdaBoost
1) easily understood by humans,
#7 : kNN
2) valid on new or test data with some degree of certainty,
#7 : Naive Bayes 3) potentially useful and
#10 : CART 4) Novel
Objective vs. subjective interestingness measures Data Mining: Confluence of Multiple Disciplines
Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, Database
actionability, etc Technology Statistics
Can a data mining system find all the interesting patterns? Do we need to
find all of the interesting patterns?
Find all the interesting patterns: Completeness
It is often unrealistic and inefficient for data mining systems to generate all of Machine Visualization
the possible patterns Learning Data Mining
Can a data mining system find only the interesting patterns?
Search for only interesting patterns: An optimization problem
Approaches Pattern
Recognition Other
First generate all the patterns and then filter out the uninteresting ones
Algorithm Disciplines
Generate only the interesting patterns—mining query optimization
8
Multi-Dimensional View of Data Mining
Data to be mined Data Mining Task primitives
Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW Each user will have a data mining task in mind, that is, some form of
data analysis that he or she would like to have performed.
Knowledge to be mined
Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
Data mining query is defined in terms of data mining task primitives
9
An example using DMQL
10
Major Issues in Data Mining Major Issues in Data Mining
Mining methodology
Performance issues
Efficiency and scalability of data mining algorithm
Mining different kinds of knowledge from diverse data types, e.g.,
bio, stream, Web Parallel, distributed, and incremental mining algorithms:
Interactive mining of knowledge at multiple levels of abstraction:
Incorporation of background knowledge: Issues relating to the diversity of database types:
Data mining query languages and ad hoc data mining Handling of relational and complex types of data
Presentation and visualization of data mining results Mining information from heterogeneous databases and global
information systems
Handling noisy or incomplete data
Pattern evaluation—the interestingness problem
11
Why Is Data Dirty? Why Is Data Preprocessing Important?
Incomplete data may come from
“Not applicable” data value when collected ( Customer information for sales No quality data, no quality mining results!
transaction)
Different considerations between the time when the data was collected and when it
Quality decisions must be based on quality data
is analyzed. e.g., duplicate or missing data may cause incorrect or even
Data my not considered important at the time of entry. misleading statistics.
Human/hardware/software problems
Data warehouse needs consistent integration of quality data
Noisy data (incorrect values) may come from
Faulty data collection instruments Data extraction, cleaning, and transformation comprises the majority
Human or computer error at data entry of the work of building a data warehouse
Errors in data transmission, limited buffer size
Inconsistent data may come from
Different data sources
Functional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning
12
Descriptive Data Summarization Measuring the Central Tendency
Mean (algebraic measure) (sample vs. population):
Descriptive Data Summarization techniques are used to identify the
typical properties of the data and highlight which data values should Weighted arithmetic mean:
be treated as nose or outliers. Trimmed mean: chopping extreme values
w
i 1
i
GRIET GRIET
13
Measuring the Dispersion of Data
Boxplot Analysis
Five-number summary of a distribution: Standard deviation s (or σ) is the square root of variance
Minimum, Q1, M, Q3, Maximum s2 (or σ2)
Boxplot
1 N 1 N 2 1 N
Data is represented with a box
2 ( xi x ) 2
[ xi ( xi )2 ]
The ends of the box are at the first and third quartiles, i.e., the N i1 N i 1 n i1
height of the box is IRQ
The median is marked by a line within the box
Whiskers: two lines outside the box extend to Minimum and
Maximum
Histograms,
Quantile plots
q-q plots
Scatter plots
and loess curves
14
Histogram Analysis
Graph displays of basic statistical class descriptions
Attribute A partitions the data distribution of A into disjoint subsets, or
buckets
Consists of a set of rectangles that reflect the counts or frequencies of
the classes present in the given data
February 13, 2012 Data Mining: Concepts and February 13, 2012 Data Mining: Concepts and
GRIET Techniques 59
GRIET Techniques
15
Scatter plot Loess Curve
Provides a first look at bivariate data to see clusters of points, outliers,
etc Adds a smooth curve to a scatter plot in order to provide better
perception of the pattern of dependence
Each pair of values is treated as a pair of coordinates and plotted as
points in the plane Loess curve is fitted by setting two parameters: a smoothing parameter,
and the degree of the polynomials that are fitted by the regression
February 13, 2012 Data Mining: Concepts and February 13, 2012 Data Mining: Concepts and
GRIET Techniques GRIET Techniques
Resolve redundancy caused by data integration data not entered due to misunderstanding
certain data may not be considered important at the time of
entry
not register history or changes of the data
Missing data may need to be inferred.
16
How to Handle Missing Data? How to Handle Noisy Data?
Ignore the tuple: usually done when class label is missing (assuming
Noise: random error or variance in a measured variable
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably. Binning
Fill in the missing value manually: tedious + infeasible? first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin boundaries.
Fill in it automatically with
a global constant : e.g., “unknown”, a new class?! Regression
smooth by fitting the data into regression functions
the attribute mean
Linear regression and multiple linear regression.
the attribute mean for all samples belonging to the same class:
smarter Clustering
the most probable value: inference-based such as Bayesian detect and remove outliers
17
Cluster Analysis Data Cleaning as a Process
The first step in data cleaning as a process is Data discrepancy
detection
18
Data Integration Handling Redundancy in Data Integration
Data integration: Redundant data occur often when integration of multiple databases
Combines data from multiple sources into a coherent data store Object identification: The same attribute or object may have
Issues to consider Schema integration & Obj Matching different names in different databases
How Entities from multiple data sources be matched up? Derivable data: One attribute may be a “derived” attribute in
Entity identification problem another table, e.g., annual revenue
e.g., A.cust-id B.cust-#
Redundant attributes may be able to be detected by correlation
Integrate metadata from different sources
analysis
Solution is the metadata. Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
Detecting and resolving data value conflicts speed and quality
Possible reasons: different representations, different scales, e.g.,
metric vs. British units
Eg: Price (dollars, euro), (kg, gram), total sales etc.,
19
Chi-Square Calculation: An Example
Male Female Sum (row)
Like science fiction 250(90) 200(360) 450 For this 2 X 2 table, the degrees of freedom are (2-1)(2-1) = 1.
Not like science fiction 50(210) 1000(840) 1050
2
For 1 degree of freedom, the X value needed to reject the hypothesis
Sum(col.) 300 1200 1500
at the 0.001 significance level is
Χ2 (chi-square) calculation (numbers in parenthesis are expected 10.828 (taken from the table of upper percentage points of the c2
counts calculated based on the data distribution in the two categories) distribution, typically available from any textbook on statistics)
It shows that gender and preferred reading are correlated in the group
So we can conclude that the two attributes are strongly correlated
2 2 2 2
(250 90) (50 210) (200 360) (1000 840)
2 507.93
90 210 360 840
Generalization: low-level or “primitive” (raw) data are replaced Suppose that the minimum and maximum values for the attribute
by higher-level concepts through the use of concept hierarchies income are $12,000 and $98,000, respectively.
Normalization: scaled to fall within a small, specified range We would like to map income to the range [0:0;1:0]
min-max normalization Then $73,000 is mapped to
z-score normalization 73 , 600 12 ,000
normalization by decimal scaling (1 . 0 0 ) 0 0 . 716
98 , 000 12 ,000
Attribute/feature construction
New attributes constructed from the given ones
20
Z-score normalization Normalization by decimal scaling
Z-score normalization (μ: mean, σ: standard deviation): Normalizes by moving the decimal point of values of attribute A.
The values for an attribute, A, are normalized based on the mean and The number of decimal points moved depends on the maximum
standard deviation of A absolute value of A.
v
v A v'
v ' 10 j
A
Suppose that the mean and standard deviation of the values for the
attribute income are $54,000 and $16,000, respectively. With z-score Suppose that the recorded values of A range from -986 to 917. The
normalization, a value of $73,600 for income is transformed to maximum absolute value of A is 986.
To normalize by decimal scaling, we therefore divide each value by
1,000 (i.e., j = 3) so that -986 normalizes to -0:986 and 917
73 , 600 54 , 000 normalizes to 0:917.
1 . 225
16 , 000
Prepared by S.Palaniappan, Assoc.Prof 81 Prepared by S.Palaniappan, Assoc.Prof 82
GRIET GRIET
21
Attribute Subset Selection
Feature selection
Heuristic methods
Redundant features
Irrelevant features
22
Top-Down Induction of Decision Tree Dimensionality Reduction: (WT, PCA)
Wavelet Transformation
Discrete wavelet transform (DWT): linear signal processing, multi-resolutional
Attributes = {Outlook, Temperature, Humidity, Wind} analysis
PlayTennis = {yes, no} Compressed approximation: store only a small fraction of the strongest of the
wavelet coefficients
Similar to discrete Fourier transform (DFT), but better lossy compression,
Outlook
localized in space
sunny rain Method: (Hierarchical pyramid algorithm)
overcast Length, L, must be an integer power of 2 (padding with 0’s, when necessary)
Each transform has 2 functions: smoothing, difference
Humidity Wind Applies to pairs of data, resulting in two set of data of length L/2
yes
Applies two functions recursively, until reaches the desired length
high normal strong weak
no yes no yes
23
Regress Analysis and Log-Linear Histograms
Models
Linear regression: Y = w X + b Divide data into buckets and store average (sum) for each bucket
Two regression coefficients, w and b, specify the line and are to
be estimated by using the data at hand
Partitioning rules:
Multiple regression: Y = b0 + b1 X1 + b2 X2. Equal-width: equal bucket range (eg age range)
Many nonlinear functions can be transformed into the above Equal-frequency (or equal-depth)
V-optimal: with the least histogram variance (weighted sum of the
Log-linear models: original values that each bucket represents)
Approximate discrete multidimensional probability distributions.
MaxDiff: set bucket boundary between each pair for pairs having the
Can be used to estimate the probability of each point in a
multidimensional space for a set of discretized attributes` β–1 largest differences - β is the user-specified number of buckets.
Higher-dimensional data space to be constructed from lower
dimensional spaces.
Clustering Sampling
Partition data set into clusters based on similarity, and store cluster Sampling: obtaining a small sample s to represent the whole data set
D containing N tuples
representation only
The “quality” of a cluster may be represented by its diameter & Simple random sample without replacement (SRSWOR) of size s:
Centroid distance This is created by drawing s of the N tuples from D (s < N)
In database systems, multidimensional index trees are primarily
used for providing fast data access. Simple random sample with replacement (SRSWR) of size s:
An index tree recursively partitions the multidimensional space for a After a tuple is drawn, it is placed back in D so that it may be
given set of data objects drawn again.
Each child of a parent node as a bucket, then an index tree can be
considered as a hierarchical histogram
24
Cluster sample
Apply SRSWOR to the pages, resulting in a cluster sample of the Discretization and Concept Hierarchy
tuples
Stratified sample: Discretization
It is a method of sampling from a population
Reduce the number of values for a given continuous attribute by
It is the process of dividing members of the population into
homogeneous subgroups before sampling dividing the range of the attribute into intervals
Interval labels can then be used to replace actual data values
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute
Concept hierarchy formation
Recursively reduce the data by collecting and replacing low level
concepts by higher level concepts
Binning | | | |
I ( S , T ) S 1 Entropy ( S 1) S 2 Entropy ( S 2 )
|S| |S |
Top-down split, unsupervised,
Entropy is calculated based on class distribution of the samples in the set.
Histogram analysis Given m classes, the entropy of S1 is
m
Clustering analysis
where pi is the probability of class i in S1
Either top-down split or bottom-up merge, unsupervised The boundary that minimizes the entropy function over all possible boundaries
Entropy-based discretization: supervised, top-down split is selected as a binary discretization
The process is recursively applied to partitions obtained until some stopping
Interval merging by 2 Analysis: unsupervised, bottom-up merge
criterion is met
Segmentation by natural partitioning: top-down split, unsupervised Such a boundary may reduce data size and improve classification accuracy
Prepared by S.Palaniappan, Assoc.Prof 99 Prepared by S.Palaniappan, Assoc.Prof 100
GRIET GRIET
25
Interval Merge by 2 Analysis Segmentation by Natural Partitioning
Merging-based (bottom-up)
A simply 3-4-5 rule can be used to segment numeric data into
Merge: Find the best neighboring intervals and merge them to form larger
intervals recursively relatively uniform, “natural” intervals.
ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002] If an interval covers 3, 6, 7 or 9 distinct values at the most
Initially, each distinct value of a numerical attr. A is considered to be one significant digit, partition the range into 3 equi-width intervals
interval If it covers 2, 4, or 8 distinct values at the most significant digit,
2 tests are performed for every pair of adjacent intervals partition the range into 4 intervals
Adjacent intervals with the least 2 values are merged together, since low 2 If it covers 1, 5, or 10 distinct values at the most significant digit,
values for a pair indicate similar class distributions partition the range into 5 intervals
This merge process proceeds recursively until a predefined stopping
criterion is met (such as significance level, max-interval, max inconsistency,
etc.)
(-$400 -$5,000)
Step 4:
Specification of a hierarchy for a set of values by explicit data grouping
(-$400 - 0) (0 - $1,000) ($1,000 - $2, 000)
($2,000 - $5, 000) {Urbana, Champaign, Chicago} < Illinois
(0 - ($1,000 -
(-$400 - $200) ($2,000 -
-$300) $1,200)
($200 - ($1,200 - $3,000)
(-$300 - $400) $1,400)
-$200)
($400 - ($1,400 -
($3,000 -
$4,000) Specification of only a partial set of attributes
(-$200 - $1,600) ($4,000 -
$600)
-$100)
(-$100 -
($600 -
$800) ($800 -
($1,600 - ($1,800 -
$1,800)
$5,000)
vague idea about what should be included in a hierarchy
$1,000) $2,000)
0)
E.g., only street < city, not others
Prepared by S.Palaniappan, Assoc.Prof 103 Prepared by S.Palaniappan, Assoc.Prof 104
GRIET GRIET
26
Automatic Concept Hierarchy Generation
Specification of a set of attributes, but not of their partial ordering
High concept level will usually contain a smaller number of
distinct values than an attribute defining a lower concept level
E.g., for a set of attributes: {street, city, state, country}
Finding the median, quartiles and inter-quartile range. Suppose that the data for analysis includes the attribute age. The age
values for the data tuples are (in increasing order)
Example 1: Find the median and quartiles for the data
below. 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35,
35, 35, 36, 40, 45, 46, 52, 70.
12, 6, 4, 9, 8, 4, 9, 8, 5, 9, 8, 10
27
(a) Use smoothing by bin means to smooth the data, using a bin
depth of 3. Illustrate your steps. Comment on the effect of this Use the two methods below to normalize the following group of data:
technique for the given data. 200, 300, 400, 600, 1000
(b) How might you determine outliers in the data? (a) min-max normalization by setting min = 0 and max = 1
(b) z-score normalization
Using the data for age, answer the following:
28