Module 3

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 187

Module 3

Data Mining:
Concepts and Techniques

3
Introduction
• Motivation: Why data mining?

• What is data mining?

• Data Mining: On what kind of data?

• Data mining functionality

• Are all the patterns interesting?

• Classification of data mining systems

• Major issues in data mining

• Applications
4
Motivation: “Necessity is the Mother
of Invention”
• Data explosion problem

– Automated data collection tools and mature database technology lead to


tremendous amounts of data stored in databases, data warehouses and
other information repositories

• We are drowning in data, but starving for knowledge!

• Solution: Data warehousing and data mining

– Data warehousing and on-line analytical processing


– Extraction of interesting knowledge (rules, regularities, patterns,
constraints) from data in large databases 5
6 Business Reasons to use Data Mining
Massive quantities of data.

Important information is in their data warehouses.

Moves beyond the analysis of past events … to predicting


future trends and behaviors.

DM tools can answer complex business questions

DM tools can explore intricate interdependencies within


databases to discover hidden patterns/relationships.

Data mining allows decision-makers to make proactive,


knowledge-driven decisions. 6
The Information Age is Here!

• "Data doubles about every year, but useful information


seems to be decreasing

• "There is a growing gap between the generation of data


and our understanding of it.“

• "The trouble with facts is that there are so many of them“

• "Get your facts first, and then you can distort them as
much as you please.“
7
The Data Flood is Everywhere

• Huge quantities of data are


being generated in all
business, government, and
research domains:
– Banking, retail, marketing,
telecommunications, other
business transactions ...
– Scientific data: genomics,
astronomy, biology, etc.
– Web, text, and e-commerce

8
Evolution of Database Technology
(See Fig. 1.1)

• 1960s:
– Data collection, database creation, IMS and network DBMS
• 1970s:
– Relational data model, relational DBMS implementation
• 1980s:
– RDBMS, advanced data models (extended-relational, OO, deductive,
etc.) and application-oriented DBMS (spatial, scientific, engineering,
etc.)
• 1990s—2000s:
– Data mining and data warehousing, multimedia databases, and Web
databases 9
What Is Data Mining?

• Data mining (knowledge discovery in databases):


– Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns from
data in large databases

• Alternative names and their “inside stories”:


– Data mining: a misnomer?
– Knowledge discovery(mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
10
What Is NOT Data Mining?

– (Deductive) query processing.

– Expert systems

– Small ML/statistical programs

11
So what is it?

Data Mining is “an information


extraction activity whose goal is to
discover hidden facts contained in
large databases.”

13
14
Why Data Mining? — Potential

Applications
Database analysis and decision support

– Market analysis and management


• target marketing, customer relation management, market basket
analysis, cross selling, market segmentation

– Risk analysis and management


• Forecasting, customer retention, improved underwriting, quality control,
competitive analysis

– Fraud detection and management


• Other Applications

– Text mining (news group, email, documents) and Web


analysis.
– Intelligent query answering
16
Market Analysis and Management (1)

• Where are the data sources for analysis?


– Credit card transactions, loyalty cards, discount coupons, customer
complaint calls, plus (public) lifestyle studies
• Target marketing
– Find clusters of “model” customers who share the same characteristics:
interest, income level, spending habits, etc.
• Determine customer purchasing patterns over time
– Conversion of single to a joint bank account: marriage, etc.
• Cross-market analysis
– Associations/co-relations between product sales
– Prediction based on the association information

17
Market Analysis and Management (2)

• Customer profiling
– data mining can tell you what types of customers buy what products
(clustering or classification)

• Identifying customer requirements


– identifying the best products for different customers
– use prediction to find what factors will attract new customers

• Provides summary information


– various multidimensional summary reports
– statistical summary information (data central tendency and variation)
18
Corporate Analysis and Risk
Management

• Finance planning and asset evaluation


– cash flow analysis and prediction
– contingent claim analysis to evaluate assets
– cross-sectional and time series analysis (financial-ratio, trend analysis,
etc.)
• Resource planning:
– summarize and compare the resources and spending
• Competition:
– monitor competitors and market directions
– group customers into classes and a class-based pricing procedure
– set pricing strategy in a highly competitive market 19
Fraud Detection and Management (1)

• Applications
– widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
• Approach
– use historical data to build models of fraudulent behavior and use data
mining to help identify similar instances
• Examples
– auto insurance: detect a group of people who stage accidents to collect
on insurance
– money laundering: detect suspicious money transactions (US Treasury's
Financial Crimes Enforcement Network)
– medical insurance: detect professional patients and ring of doctors and
ring of references 20
Fraud Detection and Management (2)

• Detecting inappropriate medical treatment


– Australian Health Insurance Commission identifies that in many cases
blanket screening tests were requested (save Australian $1m/yr).
• Detecting telephone fraud
– Telephone call model: destination of the call, duration, time of day or
week. Analyze patterns that deviate from an expected norm.
– British Telecom identified discrete groups of callers with frequent
intra-group calls, especially mobile phones, and broke a multimillion
dollar fraud.
• Retail
– Analysts estimate that 38% of retail shrink is due to dishonest
employees.
21
Other Applications

• Sports
– IBM Advanced Scout analyzed NBA game statistics (shots blocked,
assists, and fouls) to gain competitive advantage for New York
Knicks and Miami Heat
• Astronomy
– JPL and the Palomar Observatory discovered 22 quasars with the help
of data mining
• Internet Web Surf-Aid
– IBM Surf-Aid applies data mining algorithms to Web access logs for
market-related pages to discover customer preference and behavior
pages, analyzing effectiveness of Web marketing, improving Web site
organization, etc.
22
Data Mining: A KDD Process

Pattern Evaluation
– Data mining: the core of
knowledge discovery Data Mining
process.
Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases 24
Steps of a KDD Process

Data cleaning and preprocessing: (may take 60% of effort!)

Data reduction and transformation:


Find useful features, dimensionality/variable reduction

Choosing functions - summarization, classification, regression, association,


clustering.

Data mining: search for patterns of interest

Pattern evaluation and knowledge presentation -


Visualization, transformation, removing redundancy

Use of discovered knowledge 25


Data Mining and Business Intelligence

Increasing potential End User


to support Making
business decisions Decisions
Business
Data Presentation Analyst
Visualization Techniques
Data Mining Data
Analyst
Information Discovery
Data
Statistical Analysis, Querying and Reporting
Exploration
Data Warehouses / Data Marts
OLAP, MDA DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
26
Architecture of a Typical Data
Mining System
Graphical user interface

Pattern evaluation

Data mining engine


Knowledge-
Database or
base
data warehouse
Data cleaning & data integrationserver
Filtering
Data
Databases Warehouse
27
Data Mining: On What Kind of Data?
• Relational databases
• Data warehouses
• Transactional databases
• Advanced DB and information repositories
– Object-oriented and object-relational databases
– Spatial databases
– Time-series data and temporal data
– Text databases and multimedia databases
– Heterogeneous and legacy databases
– WWW 28
Data Mining Functionalities
• The Data Mining functionalities are basically
used for specifying the different kind of
patterns or trends that are usually seen in data
mining tasks.
• Data mining is extensively used in many areas
or sectors. It is used to predict and
characterize data. But the ultimate objective
in Data Mining Functionalities is to observe
the various trends in data mining.
1.Class or Concept Description
2.Mining of Frequent Patterns
3.Mining of Associations
4.Mining of Correlations
5.Mining of Clusters
Data Mining Functionalities

• Concept description: Characterization and


discrimination

– Generalize, summarize, and contrast data


characteristics, e.g., dry vs. wet regions

31
Data Mining Functionalities

• Association (correlation and causality)

– Multi-dimensional vs. single-dimensional


association
– age(X, “20..29”) ^ income(X, “20..29K”) 
buys(X, “PC”) [support = 1%, confidence = 50%]
– contains(T, “computer”)  contains(T, “software”)
[1%, 75%]
32
Data Mining Functionalities

• Classification and Prediction

– Finding models (functions) that describe and distinguish classes or


concepts for future prediction
– E.g., classify countries based on climate, or classify cars based on gas
mileage
– Presentation: decision-tree, classification rule, neural network
– Prediction: Predict some unknown or missing numerical values

33
Data Mining Functionalities

• Cluster analysis

– Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns

– Clustering based on the principle: maximizing the intra-class similarity


and minimizing the interclass similarity

34
Data Mining Functionalities

• Outlier analysis

– Outlier: a data object that does not comply with


the general behavior of the data

– It can be considered as noise or exception but is


quite useful in fraud detection, rare events analysis
35
Data Mining Functionalities

• Trend and evolution analysis


– Trend and deviation: regression analysis
– Sequential pattern mining, periodicity analysis
– Similarity-based analysis

• Other pattern-directed or statistical analyses

36
Are All the “Discovered” Patterns
Interesting?
• A data mining system/query may generate thousands of patterns, not all of them are
interesting.

– Suggested approach: Human-centered, query-based,


focused mining
• Interestingness measures: A pattern is interesting if it is easily understood by humans,
valid on new or test data with some degree of certainty, potentially useful, novel, or
validates some hypothesis that a user seeks to confirm
• Objective vs. subjective interestingness measures:

– Objective: based on statistics and structures of patterns,


e.g., support, confidence, etc.
– Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty, actionability, etc.
37
Can We Find All and Only Interesting
Patterns?

• Find all the interesting patterns: Completeness


– Can a data mining system find all the interesting patterns?
– Association vs. classification vs. clustering

• Search for only interesting patterns: Optimization


– Can a data mining system find only the interesting patterns?
– Approaches
• First general all the patterns and then filter out the uninteresting
ones.
• Generate only the interesting patterns—mining query optimization
38
Data Mining: Confluence of Multiple
Disciplines

Database
Statistics
Technology

Machine
Learning
Data Mining Visualization

Information Other
Science Disciplines
39
Data Mining Development
•Similarity Measures
•Hierarchical Clustering
•IR Systems
•Relational Data Model •Imprecise Queries
•SQL
•Association Rule •Textual Data
Algorithms •Web Search Engines
•Data Warehousing
•Scalability Techniques
•Bayes Theorem
•Regression Analysis
•EM Algorithm
•K-Means Clustering
•Algorithm Design •Time Series Analysis
Techniques
•Algorithm Analysis
•Data Structures •Neural Networks
•Decision Tree
Algorithms
40
Data Mining: Classification Schemes

• General functionality

– Descriptive data mining

– Predictive data mining

42
Data Mining Models and Tasks

43
Major Issues in Data Mining (1)
• Mining methodology and user interaction
– Mining different kinds of knowledge in databases

– Interactive mining of knowledge at multiple levels of abstraction


– Incorporation of background knowledge
– Data mining query languages and ad-hoc data mining
– Expression and visualization of data mining results
– Handling noise and incomplete data
– Pattern evaluation: the interestingness problem
• Performance and scalability
– Efficiency and scalability of data mining algorithms
– Parallel, distributed and incremental mining methods 44
Major Issues in Data Mining (2)

• Issues relating to the diversity of data types


– Handling relational and complex types of data
– Mining information from heterogeneous databases and global
information systems (WWW)
• Issues related to applications and social impacts
– Application of discovered knowledge
• Domain-specific data mining tools
• Intelligent query answering
• Process control and decision making
– Integration of the discovered knowledge with existing knowledge: A
knowledge fusion problem
– Protection of data security, integrity, and privacy
45
Data Explorations: Getting to Know Your
Data
• Data Objects and Attribute Types

• Basic Statistical Descriptions of Data

• Data Visualization

46
Types of Data Sets
• Record
– Relational records
– Data matrix, e.g., numerical matrix,

timeout

season
coach

game
score
team
crosstabs

ball

lost
pla

wi
n
y
– Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
– Transaction data
• Graph and network Document 2 0 7 0 2 1 0 0 3 0 0

– World Wide Web Document 3 0 1 0 0 1 2 2 0 3 0

– Social or information networks


– Molecular Structures
TID Items
• Ordered
1 Bread, Coke, Milk
– Video data: sequence of images
2 Beer, Bread
– Temporal data: time-series 3 Beer, Coke, Diaper, Milk
– Sequential Data: transaction sequences 4 Beer, Bread, Diaper, Milk
– Genetic sequence data 5 Coke, Diaper, Milk
• Spatial, image and multimedia:
– Spatial data: maps
– Image data: 47
Important Characteristics of Structured Data

• Dimensionality
– Curse of dimensionality
• Sparsity
– Only presence counts
• Resolution
– Patterns depend on the scale
• Distribution
– Centrality and dispersion
48
Data Objects

• Data sets are made up of data objects.


• A data object represents an entity.
• Examples:
– sales database: customers, store items, sales
– medical database: patients, treatments
– university database: students, professors, courses
• Also called samples , examples, instances, data points,
objects, tuples.
• Data objects are described by attributes.
• Database rows -> data objects; columns ->attributes.
49
Attributes
• Attribute (or dimensions, features, variables): a
data field, representing a characteristic or feature
of a data object.
– E.g., customer _ID, name, address
• Types:
– Nominal
– Binary
– Numeric: quantitative
• Interval-scaled
• Ratio-scaled
50
Attributes
• Attribute (or dimensions, features, variables): a
data field, representing a characteristic or feature
of a data object.
– E.g., customer _ID, name, address
• Types:
– Nominal
– Binary
– Numeric: quantitative
• Interval-scaled
• Ratio-scaled
51
Attribute Types
• Nominal: categories, states, or “names of things”
– Hair_color = {auburn, black, blond, brown, grey, red, white}
– marital status, occupation, ID numbers, zip codes
• Binary
– Nominal attribute with only 2 states (0 and 1)
– Symmetric binary: both outcomes equally important
• e.g., gender
– Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Ordinal
– Values have a meaningful order (ranking) but magnitude between
successive values is not known.
– Size = {small, medium, large}, grades, army rankings

52
Numeric Attribute Types
• Quantity (integer or real-valued)
• Interval
• Measured on a scale of equal-sized units
• Values have order and can be positive or negative
– E.g., temperature in C˚or F˚, calendar dates
• With ranking allows us to compare and quantify
difference between values
• Ratio
• Inherent zero-point
• We can speak of values as being a multiple of another
value
• Values are ordered
– e.g., temperature in Kelvin, length, counts, monetary
quantities
53
summary of data types and scale measures

54
Discrete vs. Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a
collection of documents
– Sometimes, represented as integer variables
– Note: Binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
• E.g., temperature, height, or weight
– Practically, real values can only be measured and represented
using a finite number of digits
– Continuous attributes are typically represented as floating-point
variables
56
Basic Statistical Descriptions of Data
• Motivation
• used to identify properties of the data and highlight which data
values should be treated as noise or outliers
1. central tendency which measure the location of the middle or
center of a data distribution the mean, median, mode, and
midrange
2. dispersion of the data. That is, how are the data spread out?
The most common data dispersion measures are the range,
quartiles, and interquartile range; the five-number summary
and boxplots; and the variance and standard deviation of the
data

57
3. Graphic displays of basic statistical
descriptions to visually inspect ,Most
statistical or graphical data presentation
software packages include bar charts, pie
charts, and line graphs. Other popular
displays of data summaries and distributions
include quantile plots, quantile–quantile
plots, histograms, and scatter plots

58
Measuring the Central Tendency
• Mean (algebraic measure):The most common and effective numeric measure of the
“center” of a set of data is the (arithmetic) mean
• Let x1,x2, …… ,xN be a set of N values or observations, such as for
• some numeric attribute X, like salary. The mean of this set of values is

59
• problem with the mean is its sensitivity to extreme (e.g., outlier)
values
• Even a small number of extreme values can corrupt the mean. For
example, the mean salary at a company may be substantially pushed
up by that of a few highly paid managers
• To offset the effect we can instead use the trimmed mean, which is
the mean obtained after chopping off values at the high and low
extremes
• For example, we can sort the values observed for salary and remove
the top and bottom 2% before computing the mean. We should
avoid trimming too large a portion (such as 20%) at both ends, as
this can result in the loss of valuable information

60
• Median: for skewed(asymmetric)data
– which is the middle value in a set of ordered data values. It is the value that separates
the higher half of a data set from the lower half
– Middle value if odd number of values, or average of the middle two values otherwise
– The median is expensive to compute when we have a large number of
observations, we can easily approximate the value

• where L1 is the lower boundary of the median interval, N is the number of values in the
entire data set,
• is the sum of the frequencies of all of the intervals that are lower than the median
interval, is the frequency of the median interval, and
• width is the width of the median interval

61
• Mode
– The mode for a set of data is the value that occurs most frequently in the
set
– it can be determined for qualitative and quantitative attributes.
– Data sets with one, two, or three modes are respectively called unimodal,
bimodal, and trimodal
– In general, a data set with two or more modes is multimodal
– At the other extreme, if each data value occurs only once, then there is
no mode.
• Midrange
– The midrange can also be used to assess the central tendency of a
numeric data set.
– It is the average of the largest and smallest values in the set.
62
• Mean. Suppose we have the following Median:
values for salary (in thousands of for same example the two
dollars), shown
middlemost values are 52 and 56
• in increasing order: 30, 36, 47, 50, 52, (that is, within the sixth and seventh
52, 56, 60, 63, 70, 70, 110 values in the list).
By convention, we assign the
average of the two middlemost values
as the median; that is, 52+56/2=54
Thus,the median is $54,000.
Suppose that we had only the first
11 values in the list. Then the median
• Mode. The data from above example are is the middlemost value.
bimodal. The two modes are $52,000 This is the sixth value in this list,
and $70,000. which has a value of $52,000
• Midrange : It is the average of the
largest and smallest values in the set.
• 30,000+110,000/2=$70,000.
63
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric, symmetric
positively and negatively skewed data

positively negatively
skewed skewed

64
Measuring the Dispersion of Data
• Range: of the set is the difference between the largest (max()) and smallest (min())
values.
• Quantiles: are points taken at regular intervals of a data distribution, dividing it into
essentially equal size consecutive sets.
• The 4-quantiles are the three data points that split the data distribution into four equal
parts; each part represents one-fourth of the data distribution. They are more commonly
referred to as quartiles

65
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
– The 100-quantiles are more commonly referred to as percentiles; they
divide the data distribution into 100 equal-sized consecutive sets

• Inter-quartile range: distance between the first and third


quartiles is a simple measure of spread that gives the range
covered by the middle half of the data
– IQR = Q3 – Q1
• For the same example quartiles for this data are the third, sixth,
and ninth values, respectively, in the sorted list.Therefore, Q1 =
$47,000 and Q3 is $63,000
• the interquartile range is IQR =63-47 =$16,000

66
Five-Number Summary, Boxplots, and Outliers
• Five number summary: of a distribution consists of the median (Q2), the quartiles Q1
and Q3, and the smallest and largest individual observations, written in the order of
Minimum, Q1, Median, Q3, Maximum
• Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot
outliers individually

• Outlier: usually, a value higher/lower than 1.5 x IQR

• Variance and standard deviation:


– Variance: A low standard deviation means that the data observations tend to
be very close to the mean, while a high standard deviation indicates that the
data are spread out over a large range of values
The variance of N observations, x1,x2, ….. ,xN, for a numeric attribute X is

– Standard deviation s (or σ) is the square root of variance s2 (or σ2)


67
Graphic Displays of Basic Statistical Descriptions

• These include
– quantile plots,
– quantile–quantile plots,
– histograms,
– scatter plots.
• Such graphs are helpful for the visual inspection of data,
which is useful for data preprocessing
• The first three of these show univariate distributions (i.e.,
data for one attribute)
• while scatter plots show bivariate distributions (i.e.,
involving two attributes)

68
Quantile Plot
• Displays all of the data (allowing the user to assess both the
overall behavior and unusual occurrences)
• Plots quantile information
– For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the
value xi

69
Quantile-Quantile (Q-Q) Plot
• Graphs the quantiles of one univariate distribution against the corresponding
quantiles of another
• View: Is there is a shift in going from one distribution to another?
• Example shows unit price of items sold at Branch 1 vs. Branch 2 for each
quantile. Unit prices of items sold at Branch 1 tend to be lower than those at
Branch 2.

70
7
1
Histogram Analysis

• Plotting histograms is a graphical method for summarizing the distribution of a given attribute, X
• If X is nominal, such as automobile model or item type, then a pole or vertical bar is drawn for
each known value of X. The height of the bar indicates the frequency (i.e., count) of that X value
• The resulting graph is more commonly known as a bar chart
• If X is numeric, the term histogram is preferred. The range of values for X is partitioned into
disjoint consecutive subranges
• subranges, referred to as buckets or bins, are disjoint subsets of the data distribution for X. The
range of a bucket is known as the width. Typically, the buckets are of equal width.
Scatter plot
• Provides a first look at bivariate data to see clusters of points,
outliers, etc
• Each pair of values is treated as a pair of coordinates and plotted
as points in the plane

72
Positively and Negatively Correlated Data

• The left half fragment is positively


correlated

• The right half is negative correlated

73
7
4

Uncorrelated Data
7
5
7
6
7
7
Data Visualization
• Why data visualization?
• Data visualization is the graphical representation of information and data.
– Gain insight into an information space by mapping data onto graphical primitives
– Provide qualitative overview of large data sets
– Search for patterns, trends, structure, irregularities, relationships among data
– Help find interesting regions and suitable parameters for further quantitative analysis
– Provide a visual proof of computer representations derived
• Categorization of visualization methods:
– Pixel-oriented visualization techniques
– Geometric projection visualization techniques
– Icon-based visualization techniques
– Hierarchical visualization techniques
– Visualizing complex data and relations

78
Pixel-Oriented Visualization Techniques
• For a data set of m dimensions, create m windows on the screen, one for each
dimension
• The m dimension values of a record are mapped to m pixels at the
corresponding positions in the windows
• The colors of the pixels reflect the corresponding values

(a) Income (b) Credit (c) transaction (d) age


Limit volume 79
Laying Out Pixels in Circle Segments
• To save space and show the connections among multiple dimensions, space
filling is often done in a circle segment

(a) Representing a data


(b) Laying out pixels in circle
record in circle segment
segment
80
Geometric Projection Visualization Techniques
• A drawback of pixel-oriented visualization techniques is that they cannot help
us much in understanding the distribution of data in a multidimensional space
• Geometric projection techniques help users find interesting projections of
multidimensional data sets

81
• A 3-D scatter plot uses three axes in a Cartesian coordinate
system. If it also uses color, it can display up to 4-D data
points

82
• The scatter-plot matrix technique is a useful extension to the scatter plot.
For an n dimensional data set, a scatter-plot matrix is an nxn grid of 2-D scatter
plots that provides a visualization of each dimension with every other dimension

83
• The scatter-plot matrix becomes less effective as the dimensionality increases.
• Another popular technique, called parallel coordinates, can handle higher dimensionality.
• To visualize n-dimensional data points, the parallel coordinates technique draws n
equally spaced axes, one for each dimension, parallel to one of the display axes

84
Icon-Based Visualization Techniques
• Visualization of the data values as features of icons
• Typical visualization methods
– Chernoff Faces
– Stick Figures

85
Chernoff Faces
• A way to display variables on a two-dimensional surface, e.g., let x be eyebrow
slant, y be eye size, z be nose length, etc.
• The figure shows faces produced using 10 characteristics--head eccentricity,
eye size, eye spacing, eye eccentricity, pupil size, eyebrow slant, nose size,
mouth shape, mouth size, and mouth opening

• Chernoff faces make use of the ability of the


human mind to recognize small differences in
facial characteristics and to assimilate many facial
characteristics at once

86
Stick Figure

 multidimensional data to five-piece stick figures

A 5-piece stick figure (1 body and 4 limbs w.


different angle/length)

Two dimensions are mapped to the display (x and


y) axes and the remaining dimensions are mapped to
the angle and/or length of the limbs

87
shows census data, where age and income are mapped to the
display axes, and the remaining dimensions (gender, education, and
so on) are mapped to stick figures.
If the data items are relatively dense with respect to the two
display dimensions, the resulting visualization shows texture
patterns, reflecting data trends
88
Hierarchical Visualization Techniques

• This techniques partition all dimensions into


subsets (i.e., subspaces). The subspaces are
visualized in hierarchical manner.
• Methods
– Worlds-within-Worlds
– Tree-Map

89
Dimensional Stacking

• Partitioning of the n-dimensional attribute space in 2-D subspaces,


which are ‘stacked’ into each other
• Partitioning of the attribute value ranges into classes. The
important attributes should be used on the outer levels.
• Adequate for data with ordinal attributes of low cardinality
• But, difficult to display more than nine dimensions
• Important to map dimensions appropriately
90
Dimensional Stacking
Used by permission of M. Ward, Worcester Polytechnic Institute

Visualization of oil mining data with longitude and latitude mapped


to the outer x-, y-axes and ore grade and depth mapped to the
inner x-, y-axes
91
Worlds-within-Worlds
• Worlds-within-Worlds,” also known as n-Vision
• Suppose we want to visualize a 6-D data set, where the dimensions are F,X1, …
,X5
• We want to observe how dimension F changes with respect to the other
Dimensions

92
• first fix the values of dimensions X3,X4,X5 to some selected
values, say, c3, c4, c5.
• We can then visualize F,X1,X2 using a 3-D plot, called a world, as
shown
• The position of the origin of the inner world is located at the
point(c3, c4, c5) in the outer world, which is another 3-D plot using
dimensions X3,X4,X5.
• A user can interactively change, in the outer world, the location of
the origin of the inner world.
• The user then views the resulting changes of the inner world.

93
Tree-Map
• tree-maps display hierarchical data as a set of nested rectangles
• All news stories are organized into seven categories, each shown in a large rectangle of a
unique color.
• Within each category (i.e., each rectangle at the top level), the news stories are further
partitioned into smaller subcategories

Ack.: http://www.cs.umd.edu/hcil/treemap- 94
Tree-Map of a File System (Schneiderman)

95
InfoCube
• A 3-D visualization technique where hierarchical
information is displayed as nested semi-transparent
cubes
• The outermost cubes correspond to the top level
data, while the subnodes or the lower level data are
represented as smaller cubes inside the outermost
cubes, and so on

96
Data Preprocessing

• Data Preprocessing: An Overview

– Data Quality

– Major Tasks in Data Preprocessing

• Data Cleaning

• Data Integration

• Data Reduction

• Data Transformation and Data Discretization

• Summary
98
Data Quality: Why Preprocess the
Data?
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday? 100
Data Quality: Why Preprocess the Data?

• Measures for data quality: A multidimensional view


– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not, dangling, …
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be understood?

101
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation
102
103
Incomplete (Missing) Data
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of
entry
– not register history or changes of the data
• Missing data may need to be inferred
104
Data Cleaning
• Data cleaning (or data cleansing) routines attempt to
fill in missing values, smooth out noise while identifying
outliers, and correct inconsistencies in the data
– Missing Values
– Noisy Data
– Data Cleaning as a Process

105
How to Handle Missing Data?
1. Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
2. Fill in the missing value manually: tedious + infeasible?
3. Fill in it automatically with a global constant : e.g.,
“unknown”, a new class?!
4. Use a measure of central tendency for the attribute to fill in
the missing value
the attribute mean or median depending on data

106
5. Use the attribute mean or median for all samples belonging to the
same class as the given tuple
the attribute mean for all samples belonging to the same class:
smarter
6. Use the most probable value to fill in the missing value
the most probable value: inference-based such as Bayesian formula or decision tree
• Methods 3 through 6 bias the data—the filled-in value may not be
correct.
• Method 6, however, is a popular strategy as it uses the most information
from the present data to predict missing values

107
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data
108
How to Handle Noisy Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.

109
Binning
• Binning methods smooth a sorted data value by consulting its “neighborhood,”
that is, the values around it
– The sorted values are distributed into a number of “buckets,” or bins. i.e
local smoothing.
– example, the data for price are first sorted and then partitioned into equal-
frequency bins of size 3 (i.e., each bin contains three values)

• smoothing by bin means, each


– value in a bin is replaced by the mean value of the bin. For example, the
mean of the values 4, 8,9 and 15 in Bin 1 is 9. Therefore, each original
value in this bin is replaced by the value 9
– Similarly, smoothing by bin medians can be employed, in which each bin value
is replaced by the bin median

110
Binning Methods for Data Smoothing
 Sorted data for price (in dollars): 4, 8,
9, 15, 21, 21, 24, 25, 26, 28, 29, 34
smoothing by bin
* Partition into equal-frequency (equi-
boundaries,
depth) bins:
 the minimum and
- Bin 1: 4, 8, 9, 15
maximum values in a
- Bin 2: 21, 21, 24, 25 given bin are identified
- Bin 3: 26, 28, 29, 34 as the bin boundaries.
* Smoothing by bin means: Each bin value
- Bin 1: 9, 9, 9, 9 is then replaced by the
- Bin 2: 23, 23, 23, 23 closest boundary value.
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
Binning is also used as a
- Bin 1: 4, 4, 4, 15
discretization
- Bin 2: 21, 21, 25, 25
technique
- Bin 3: 26, 26, 26, 34
111
Consider the following example for equal frequency binning.
• Dataset: 10,15, 18, 20, 31, 34, 41, 46, 51, 53, 54, 60
• First, we have to sort the data. If we have unsorted data, we need to convert it
into sorted data and then apply the second step. In the second step, we have to
find the frequency. To calculate the frequency, we can use the formula a total
number of data points/number of bins.
• In this case, the total number of data points is 12, and the
number of bins required is 3. Therefore, the frequency
comes out to be 4. Now let's add values into the bins.
• BIN 1: 10, 15, 18, 20
• BIN 2: 31, 34, 41, 46
• BIN 3: 51, 53, 54, 60
Customer Segmentation
• Binning can be used in customer segmentation. A business can optimize its
marketing strategies using the Binning method in customer segmentation.
■ Regression
■ smooth by fitting the data into regression functions

■ Linear regression involves finding the “best” line to fit two


attributes (or variables) so that one attribute can be used to
predict the other
■ Multiple linear regression is an extension of linear regression,
where more than two attributes are involved and the data are fit
to a multidimensional surface
■ Outlier analysis:
■ Outliers may be detected by clustering

114
Data Cleaning as a Process
■ Data discrepancy detection
■ Use metadata (e.g., domain, range, dependency, distribution)

■ Check field overloading

■ Check uniqueness rule, consecutive rule and null rule

■ Use commercial tools

■ Data scrubbing: use simple domain knowledge (e.g., postal

code, spell-check) to detect errors and make corrections


■ Data auditing: by analyzing data to discover rules and

relationship to detect violators (e.g., correlation and clustering


to find outliers)
■ Data migration
■ Data migration tools: allow transformations to be specified
■ ETL (Extraction/Transformation/Loading) tools
■ allow users to specify transformations through a graphical user interface
■ Integration of the two processes
■ Iterative and interactive (e.g., Potter’s Wheels)

115
Data Preprocessing

■ Data Preprocessing: An Overview

■ Data Quality

■ Major Tasks in Data Preprocessing

■ Data Cleaning

■ Data Integration

■ Data Reduction

■ Data Transformation and Data Discretization

■ Summary

116
Data Integration
■ Data integration:
■ Combines data from multiple sources into a coherent store
■ help improve the accuracy and speed of the subsequent data
mining process.
■ FOLLOWING TECHNIQUES ARE USED

■ Entity identification problem:


■ Redundancy and Correlation Analysis
■ Tuple Duplication
■ Data Value Conflict Detection and Resolution

117
Entity Identification Problem

■ Schema integration and object matching can be


tricky
■ How can equivalent real-world entities from multiple data
sources be matched up? This is referred to as the entity
identification problem
■ Special attention must be paid to the structure of the
data to ensure that any attribute functional dependencies
and referential constraints in the source system match
those in the target system

118
Redundancy and Correlation Analysis
■ Redundant data occur often when integration of multiple databases
■ Object identification: The same attribute or object may have
different names in different databases
■ Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue
■ can be detected by correlation analysis
■ Given two attributes, such analysis can measure how strongly one
attribute implies the other, based on the available data
■ For nominal data, we use the (chi-square) test.
■ For numeric attributes, we can use the correlation coefficient and

covariance, both of which access how one attribute’s values vary


from those of another

119
Correlation Analysis (Nominal Data)

120
* Data Mining: Concepts and Techniques 121
Chi-Square Calculation: An Example

■ Χ2 (chi-square) calculation (numbers in parenthesis are


expected counts calculated based on the data distribution in
the two categories)

For this 22 table, the degrees of freedom are (2-1)(2-1)=1


■ For 1 degree of freedom, value needed to reject the hypothesis at the
0.001 significance level is 10.828
■ Since our computed value is above this, we can reject the hypothesis that
gender and preferred reading are independent and conclude that the two
attributes are (strongly) correlated for the given group of people.

122
Correlation Analysis (Numeric Data)

■ Correlation coefficient (also called Pearson’s product


moment coefficient)

where n is the number of tuples, and are the respective


means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.
■ If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
■ rA,B = 0: independent; rAB < 0: negatively correlated

123
Covariance (Numeric Data)

124
* Data Mining: Concepts and Techniques 125
■ Positive covariance: If CovA,B > 0, then A and B both tend to be
larger than their expected values.
■ Negative covariance: if one of attributes tends to be above its
expected value when the other attribute is below its expected value,
then the covariance of A and B is negative
■ Independence: CovA,B = 0 but the converse is not true:
■ Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data
follow multivariate normal distributions) does a covariance of 0 imply
independence

126
* Data Mining: Concepts and Techniques 127
* Data Mining: Concepts and Techniques 128
Tuple Duplication

■ In addition to detecting redundancies between


attributes, duplication should also be detected at
the tuple level
■ Inconsistencies often arise between various
duplicates, due to inaccurate data entry or
pdating some but not all data occurrences

* Data Mining: Concepts and Techniques 129


Data Value Conflict Detection and
Resolution

■ Data integration also involves the detection and


resolution of data value conflicts
■ For example, for the same real-world entity,
attribute values from different sources may differ.
■ This may be due to differences in representation,
scaling, or encoding
■ Attributes may also differ on the abstraction level,
where an attribute in one system is recorded at,
say, a lower abstraction level than the “same”
attribute in another

* Data Mining: Concepts and Techniques 130


: Data Preprocessing

■ Data Preprocessing: An Overview

■ Data Quality

■ Major Tasks in Data Preprocessing

■ Data Cleaning

■ Data Integration

■ Data Reduction

■ Data Transformation and Data Discretization

■ Summary

131
Data Reduction Strategies
■ Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
■ Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
run on the complete data set.
■ Data reduction strategies
■ Dimensionality reduction: is the process of reducing number of

random variables or attributes under consideration


■ Attribute subset selection

■ Numerosity reduction:techniques replace the original data volume

by alternative, smaller forms of data representation


■ Histograms, clustering, sampling
■ Data cube aggregation
■ Data compression

132
Data Reduction 1: Dimensionality
Reduction
■ Curse of dimensionality
■ When dimensionality increases, data becomes increasingly sparse
■ Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
■ The possible combinations of subspaces will grow exponentially
■ Dimensionality reduction
■ Avoid the curse of dimensionality
■ Help eliminate irrelevant features and reduce noise
■ Reduce time and space required in data mining
■ Allow easier visualization
■ Dimensionality reduction techniques
■ Wavelet transforms
■ Principal Component Analysis
■ Supervised and nonlinear techniques (e.g., feature selection)

133
Mapping Data to a New Space
■ Fourier transform
■ Wavelet transform

Two Sine Waves Two Sine Waves + Noise Frequency

134
What Is Wavelet Transform?
■ Decomposes a signal into
different frequency subbands
■ Applicable to n-
dimensional signals
■ Data are transformed to
preserve relative distance
between objects at different
levels of resolution
■ Allow natural clusters to
become more distinguishable
■ Used for image compression

135
Wavelet Transformation
Haar2 Daubechie4
■ Discrete wavelet transform (DWT) for linear signal
processing, multi-resolution analysis
■ Compressed approximation: store only a small fraction of
the strongest of the wavelet coefficients
■ Similar to discrete Fourier transform (DFT), but better
lossy compression, localized in space
■ Method:
■ Length, L, must be an integer power of 2 (padding with 0’s, when
necessary)
■ Each transform has 2 functions: smoothing, difference
■ Applies to pairs of data, resulting in two set of data of length L/2
■ Applies two functions recursively, until reaches the desired length
136
Wavelet Decomposition
■ Wavelets: A math tool for space-efficient hierarchical
decomposition of functions
■ S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^ =
[23/4, -11/4, 1/2, 0, 0, -1, -1, 0]
■ Compression: many small detail coefficients can be
replaced by 0’s, and only the significant coefficients are
retained

137
Haar Wavelet Coefficients
Coefficient “Supports”
Hierarchical 2.75
decomposition 2.75 +
structure (a.k.a. +
“error tree”) + -1.25
-
-1.25
+ -
0.5
0.5 0 + -
+ - + - 0
+ -
0 -1 -1 0
+ - + - + - + -
0
+ -+
2 2 0 2 3 5 4 4 -1
-1 -+
Original frequency distribution 0 -+
-
138
Why Wavelet Transform?
■ Use hat-shape filters
■ Emphasize region where points cluster

■ Suppress weaker information in their boundaries

■ Effective removal of outliers


■ Insensitive to noise, insensitive to input order

■ Multi-resolution
■ Detect arbitrary shaped clusters at different scales

■ Efficient
■ Complexity O(N)

■ Only applicable to low dimensional data

139
Principal Component Analysis (PCA)
■ Find a projection that captures the largest amount of variation in data
■ The original data are projected onto a much smaller space, resulting
in dimensionality reduction. We find the eigenvectors of the
covariance matrix, and these eigenvectors define the new space

x2

x1
140
Principal Component Analysis (Steps)
■ Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
■ Normalize input data: Each attribute falls within the same range
■ Compute k orthonormal (unit) vectors, i.e., principal components
■ Each input data (vector) is a linear combination of the k principal
component vectors
■ The principal components are sorted in order of decreasing
“significance” or strength
■ Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
■ Works for numeric data only
141
Attribute Subset Selection
■ way to reduce dimensionality of data
■ Redundant attributes
■ Duplicate much or all of the information contained in
one or more other attributes
■ E.g., purchase price of a product and the amount of
sales tax paid
■ Irrelevant attributes
■ Contain no information that is useful for the data
mining task at hand
■ E.g., students' ID is often irrelevant to the task of
predicting students' GPA

142
Heuristic Search in Attribute Selection
■ There are 2n possible attribute combinations of n attributes
■ Typical heuristic attribute selection methods:
1. Stepwise forward selection: The procedure starts with an empty set of
attributes as
■ the reduced set. The best of the original attributes is determined
and added to the reduced set. At each subsequent iteration or step,
the best of the remaining original attributes is added to the set.
2. Stepwise backward elimination: The procedure starts with the full set
of attributes.
■ At each step, it removes the worst attribute remaining in the set.
3. Combination of forward selection and backward elimination: The
stepwise forward
■ selection and backward elimination methods can be combined so
that, at each step, the procedure selects the best attribute and
removes the worst from among the remaining attributes.

143
■ 4. Decision tree induction: Decision tree
algorithms (e.g., ID3, C4.5, and CART) were
■ originally intended for classification

■ Decision tree induction constructs a flowchart like

structure where each internal (nonleaf) node denotes a


test on an attribute, each branch corresponds to an
outcome of the test, and each external (leaf) node
denotes a class prediction
■ At each node, the algorithm chooses the “best”

attribute to partition the data into individual classes.

144
Data Reduction 2: Numerosity
Reduction
■ Reduce data volume by choosing alternative, smaller
forms of data representation
■ Parametric methods (e.g., regression)
■ Assume the data fits some model, estimate model

parameters, store only the parameters, and discard


the data (except possible outliers)
■ Ex.: Log-linear models—obtain value at a point in m-

D space as the product on appropriate marginal


subspaces
■ Non-parametric methods
■ Do not assume models

■ Major families: histograms, clustering, sampling, …

145
■ In (simple) linear regression, the data are modeled to fit a
straight line. For example, a random variable, y (called a
response variable), can be modeled as a linear function of
another random variable, x (called a predictor variable), with
the equation
■ Y=wx+b
■ where the variance of y is assumed to be constant.
■ In the context of data mining, x and y are numeric database
attributes.
■ The coefficients, w and b (called regression coefficients),
specify the slope of the line and the y-intercept, respectively
■ Multiple linear regression is an extension of (simple) linear regression,
which allows a response variable, y, to be modeled as a linear function of
two or more predictor variables.
146
■ Log-linear models approximate discrete multidimensional probability
distributions.
■ Given a set of tuples in n dimensions (e.g., described by n attributes), we can
consider each tuple as a point in an n-dimensional space.
■ Log-linear models can be used to estimate the probability of each point in a
multidimensional space for a set of discretized attributes,
■ This allows a higher-dimensional data space to be constructed from lower-
dimensional spaces.
■ Log-linear models are therefore also useful for dimensionality reduction
(since the lower-dimensional points together typically occupy less space than
the original data points) and data smoothing (since aggregate estimates in the
lower-dimensional space are less subject to sampling variations than the
estimates in the higher-dimensional space).

147
Parametric Data Reduction:
Regression and Log-Linear Models
■ Linear regression
■ Data modeled to fit a straight line

■ Often uses the least-square method to fit the line

■ Multiple regression
■ Allows a response variable Y to be modeled as a

linear function of multidimensional feature vector


■ Log-linear model
■ Approximates discrete multidimensional probability

distributions

148
y
Regression Analysis
Y1

■ Regression analysis: A collective name for


techniques for the modeling and analysis Y1’
y=x+1
of numerical data consisting of values of a
dependent variable (also called
response variable or measurement) and X1 x
of one or more independent variables (aka.
explanatory variables or predictors)
■ Used for prediction
■ The parameters are estimated so as to give (including forecasting of
a "best fit" of the data time-series data), inference,
■ Most commonly the best fit is evaluated by hypothesis testing, and
modeling of causal
using the least squares method, but
relationships
other criteria have also been used

149
Regress Analysis and Log-Linear
Models
■ Linear regression: Y = w X + b
■ Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
■ Using the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….
■ Multiple regression: Y = b0 + b1 X1 + b2 X2
■ Many nonlinear functions can be transformed into the above
■ Log-linear models:
■ Approximate discrete multidimensional probability distributions
■ Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset
of dimensional combinations
■ Useful for dimensionality reduction and data smoothing
150
Histogram Analysis

■ Histograms use binning to approximate data


distributions and are a popular form of data
reduction
■ A histogram for an attribute, A, partitions the data
distribution of A into disjoint subsets, referred to as
buckets or bins
■ If each bucket represents only a single attribute–
value/frequency pair, the buckets are called
singleton buckets

151
■ Histograms Example
■ The following data are a list of AllElectronics prices
for commonly sold items (rounded to the nearest
dollar). The numbers have been sorted: 1, 1, 5, 5, 5,5, 5,
8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15,
15, 18, 18, 18, 18, 18,18, 18, 18, 20, 20, 20, 20, 20, 20,
20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30,30, 30.

152
■ Figure shows a histogram for the data using
singleton buckets

153
■ To further reduce the data, it is common to have
each bucket denote a continuous value range for
the given attribute. In Figure , each bucket
represents a different $10 range for price
154
■ “How are the buckets determined and the attribute values
partitioned?”
■ There are several partitioning rules, includingfollowing:
■ Equal-width: In an equal-width histogram, the width

of each bucket range is uniform (e.g., the width of $10


for the buckets in Figure 3.8).
■ Equal-frequency (or equal-depth): In an equal-

frequency histogram, the buckets are created so that,


roughly, the frequency of each bucket is constant (i.e.,
each bucket contains roughly the same number of
contiguous data samples)
■ Histograms are highly effective at approximating both
sparse and dense data, as well as highly skewed and
uniform data 155
Clustering
■ Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
■ Can be very effective if data is clustered but not if data
is “smeared”
■ Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
■ There are many choices of clustering definitions and
clustering algorithms

156
Sampling

■ Sampling: obtaining a small sample s to represent the


whole data set N
■ Key principle: Choose a representative subset of the data
■ Common ways of sampling:
■ Simple random sample without replacement of size
(SRSWOR)
■ Simple random sample with replacement of size
(SRSWR)
■ Cluster sample
■ Stratified sample

157
Types of Sampling

■ Simple random sample without replacement (SRSWOR) of size


s: This is created by drawing s of the N tuples from D (s < N), where
the probability of drawing any tuple in D is 1=N, that is, all tuples are
equally likely to be sampled
■ Simple random sample with replacement (SRSWR) of size s:
This is similar to SRSWOR, except that each time a tuple is drawn from
D, it is recorded and then replaced. That is, after a tuple is drawn, it is
placed back in D so that it may be drawn again.
■ Cluster sample: If the tuples in D are grouped into M mutually
disjoint “clusters,” then an SRS of s clusters can be obtained, where
s < M.
For example, tuples in a database are usually retrieved a page at a time,
so that each page can be considered a cluster

158
159
Sampling: With or without Replacement

Raw Data
160
❑ Stratified sample: If D is divided intomutually disjoint arts called
strata, a stratified sample of D is generated by obtaining an SRS at each
stratum.
❑This helps ensure a representative sample, especially when the data are
skewed. For example, a stratified sample may be obtained from customer
data, where a stratum is created for each customer age group
❑In this way, the age group having the smallest number of customers will
be sure to be represented
Data Cube Aggregation
■ Concept hierarchies may exist for each attribute, allowing the analysis
of data at multiple abstraction levels
■ For example, a hierarchy for branch could allow branches to be
grouped into regions, based on their address
■ Data cubes provide fast access to precomputed, summarized data,
thereby benefiting online analytical processing as well as data mining
■ The cube created at the lowest abstraction level is referred to as the
base cuboid
■ base cuboid should correspond to an individual entity of interest such
as sales or customer.
■ A cube at the highest level of abstraction is the apex cuboid. For
the sales data in Figure 3.11,
■ the apex cuboid would give one total—the total sales for all three
years, for all item types, and for all branches

166
Data Reduction : Data Compression
■ String compression
■ There are extensive theories and well-tuned algorithms

■ Typically lossless, but only limited manipulation is

possible without expansion


■ Audio/video compression
■ Typically lossy compression, with progressive refinement

■ Sometimes small fragments of signal can be

reconstructed without reconstructing the whole


■ Time sequence is not audio
■ Typically short and vary slowly with time

■ Dimensionality and numerosity reduction may also be


considered as forms of data compression
168
Data Compression

Original Data Compressed


Data
lossless

Original Data
Approximated

169
Data Preprocessing

■ Data Preprocessing: An Overview

■ Data Quality

■ Major Tasks in Data Preprocessing

■ Data Cleaning

■ Data Integration

■ Data Reduction

■ Data Transformation and Data Discretization

■ Summary

170
Data Transformation
■ A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
■ Methods
■ Smoothing: Remove noise from data, techniques include binning,
regression, and clustering
■ Attribute/feature construction
■ New attributes constructed from the given ones
■ Aggregation: Summarization, data cube construction
■ Normalization: where the attribute data are scaled so as to fall
within a smaller range, such as -1.0 to 1.0, or 0.0 to 1.0.
■ Discretization: where the raw values of a numeric attribute (e.g.,
age) are replaced by interval labels (e.g., 0–10, 11–20, etc.) or
conceptual labels (e.g., youth, adult, senior), labels, in turn, can
be recursively organized into higher-level concepts, resulting in a
concept hierarchy for the numeric attribute
171
■ Concept hierarchy generation for nominal data: where attributes
such as street can be generalized to higher-level concepts, like
city or country. Many hierarchies for nominal attributes are
implicit within the database schema and can be automatically
defined at the schema definition level

172
Data Transformation by Normalization

■ In general, expressing an attribute in smaller


units will lead to a larger range for that attribute,
and thus tend to give such an attribute greater
effect or “weight.”
■ To help avoid dependence on the choice of
measurement units, the data should be
normalized or standardized. This involves
transforming the data to fall within a smaller or
common range such as [-1, 1] or [0.0, 1.0]
■ Normalizing the data attempts to give all
attributes an equal weight
173
Data Transformation by Normalization
■ Min-max normalization: to [new_minA, new_maxA]

■ Ex. Let income range $12,000 to $98,000 normalized to [0.0,


1.0]. Then $73,000 is mapped to
■ Z-score normalization (μ: mean, σ: standard deviation):

■ Ex. Let μ = 54,000, σ = 16,000. Then


■ Normalization by decimal scaling

174
Data Transformation by Normalization
■ let A be a numeric attribute with n observed values V1, V2,…… Vn
■ Min-max normalization: performs a linear transformation on the original
data
■ Suppose that minA, and maxA are the minimum and maximum values of an
attribute, A. Min-max normalization maps a value, Vi of A to Vi’ in the range
[new_minA, new_maxA] by computing

■ Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].


Then $73,600 is mapped to

* 175
■ Z-score normalization (μ: mean, σ: standard deviation): the values
for an attribute, A, are normalized based on the mean (i.e., average)
and standard deviation of A,
■ The value Vi of A is normalized to Vi’

■ Ex. Let μ = 54,000, σ = 16,000. Then

■ A variation of this z-score normalization replaces the standard


deviation of above Eq by the mean absolute deviation of A,

■ z-score normalization using the mean absolute deviation

* 176
■ Normalization by decimal scaling: normalizes by moving the
decimal point of values of attribute A. The number of decimal points
moved depends on the maximum absolute value of A
■ The value Vi of A is normalized to Vi

■ Where j is the smallest integer such that Max(|ν’|) < 1


■ Suppose that the recorded values of A range from -986 to 917
■ The maximum absolute value of A is 986. To normalize by decimal
scaling, we therefore divide each value by 1000 (i.e., j =3) so that -
986 normalizes to -0.986 and 917 normalizes to 0.917.

* Data Mining: Concepts and Techniques 177


Discretization
■ Discretization techniques can be categorized based on how the
discretization is performed,

■ If the discretization process uses class information, then we say it


is supervised discretization, Otherwise, it is unsupervised.
■ If the process starts by first finding one or a few points (called split
points or cut points) to split the entire attribute range, and then
repeats this recursively on the resulting intervals, it is called top-
down discretization or splitting
■ Contrasts with bottom-up discretization or merging, which
starts by considering all of the continuous values as potential split-
points, removes some by merging neighborhood values to form
intervals, and then recursively applies this process to the resulting
intervals

178
Data Discretization Methods
■ Typical methods: All the methods can be applied recursively
■ Binning
■ Top-down split, unsupervised For example, attribute values can be
discretized by applying equal-width or equal-frequency binning, and
then replacing each bin value by the bin mean or median, as in
smoothing by bin means or smoothing by bin medians, respectively
■ Histogram analysis
■ Top-down split, unsupervised equal-width histogram, for example,
the values are partitioned into equal-size partitions or ranges
■ Clustering analysis (unsupervised, top-down split or bottom-up
merge)
■ Decision-tree analysis (supervised, top-down split)

179
Labels
(Binning vs. Clustering)

Data Equal interval width (binning)

Equal frequency (binning) K-means clustering leads to better results

180
Discretization by Classification &
Correlation Analysis
■ Classification (e.g., decision tree analysis)
■ Supervised: Given class labels, e.g., cancerous vs. benign
■ Using entropy to determine split point (discretization point)
■ Top-down, recursive split
■ Details to be covered in Chapter 7
■ Correlation analysis (e.g., Chi-merge: χ2-based discretization)
■ Supervised: use class information
■ Bottom-up merge: find the best neighboring intervals (those
having similar distributions of classes, i.e., low χ2 values) to merge
■ Merge performed recursively, until a predefined stopping condition

181
Concept Hierarchy Generation

■ Concept hierarchy organizes concepts (i.e., attribute values)


hierarchically and is usually associated with each dimension in a data
warehouse
■ Concept hierarchies facilitate drilling and rolling in data warehouses to
view data in multiple granularity
■ Concept hierarchy formation: Recursively reduce the data by collecting
and replacing low level concepts (such as numeric values for age) by
higher level concepts (such as youth, adult, or senior)
■ Concept hierarchies can be explicitly specified by domain experts
and/or data warehouse designers
■ Concept hierarchy can be automatically formed for both numeric and
nominal data. For numeric data, use discretization methods shown.

182
Concept Hierarchy Generation
for Nominal Data
■ Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
■ street < city < state < country
■ Specification of a portion of a hierarchy for a set of values
by explicit data grouping
■ “{Alberta, Saskatchewan, Manitoba} ⊂ prairies Canada” and
■ “{British Columbia, prairies Canada} ⊂ Western Canada”
■ Specification of a set of attributes, but not of their partial
ordering: A user may specify a set of attributes forming a
concept hierarchy, but omit to explicitly state their partial
ordering. The system can then try to automatically
generate the attribute ordering so as to construct a
meaningful concept hierarchy
183
Automatic Concept Hierarchy Generation
■ Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
■ The attribute with the most distinct values is placed at

the lowest level of the hierarchy


■ Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values


184
Data Preprocessing

■ Data Preprocessing: An Overview

■ Data Quality

■ Major Tasks in Data Preprocessing

■ Data Cleaning

■ Data Integration

■ Data Reduction

■ Data Transformation and Data Discretization

■ Summary

185
Summary
■ Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
■ Data cleaning: e.g. missing/noisy values, outliers
■ Data integration from multiple sources:
■ Entity identification problem

■ Remove redundancies

■ Detect inconsistencies

■ Data reduction
■ Dimensionality reduction

■ Numerosity reduction

■ Data compression

■ Data transformation and data discretization


■ Normalization

■ Concept hierarchy generation

186
Links
• https://www.geeksforgeeks.org/discretization-by-histogram-analysis-in-
data-mining/
• https://zebrabi.com/advanced-guide/how-to-create-a-histogram-in-
power-
bi/#:~:text=Step%2Dby%2DStep%20Guide%20to%20Creating%20a%2
0Histogram%20in%20Power%20BI,-
Now%20that%20your&text=Click%20on%20the%20%22Visualizations
%22%20pane,field%20of%20the%20histogram%20visualization.
• https://youtu.be/cz1-SITC5vE
• https://www.javatpoint.com/discretization-in-data-mining
• https://youtu.be/cz1-SITC5vE

You might also like